WO2023016267A1 - Spam comment identification method and apparatus, and device and medium - Google Patents

Spam comment identification method and apparatus, and device and medium Download PDF

Info

Publication number
WO2023016267A1
WO2023016267A1 PCT/CN2022/108563 CN2022108563W WO2023016267A1 WO 2023016267 A1 WO2023016267 A1 WO 2023016267A1 CN 2022108563 W CN2022108563 W CN 2022108563W WO 2023016267 A1 WO2023016267 A1 WO 2023016267A1
Authority
WO
WIPO (PCT)
Prior art keywords
comments
identified
comment
subject
similarity
Prior art date
Application number
PCT/CN2022/108563
Other languages
French (fr)
Chinese (zh)
Inventor
邓冰娜
谢永恒
火一莽
郭子剑
Original Assignee
北京锐安科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京锐安科技有限公司 filed Critical 北京锐安科技有限公司
Publication of WO2023016267A1 publication Critical patent/WO2023016267A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model

Definitions

  • the embodiments of the present application relate to the technical field of big data mining, for example, to a method, device, equipment and medium for identifying spam comments.
  • methods for preventing and identifying spam comments are mainly divided into two categories: manual identification methods and automatic identification methods.
  • the automatic identification method can be further divided into classification identification method based on training set and identification method based on similarity.
  • the method of artificial recognition can only identify newly published comments, filter out the spam comments in the newly published comments, but can do nothing for the published spam comments; the method of artificial recognition requires continuous manual maintenance, which is not very convenient ; Moreover, spammers can use a variety of proxy methods to cheat filtering mechanisms.
  • the classification method based on the training set due to the convenience of the network, the comment update speed is relatively fast, and the feature words change a lot, so in order to make the classifier more accurately identify spam comments, the training samples must be changed with this change If the training sample changes, the feature item must be reselected, and the weight of the feature item must be recalculated and extracted, which seriously affects the efficiency of the system operation and brings inconvenience.
  • the embodiments of the present application provide a method, device, device and medium for identifying spam comments, so as to realize automatic identification of spam comments in Internet comment information.
  • the embodiment of the present application provides a method for identifying spam comments, including:
  • At least one of the following operations is performed on the set of subject terms to obtain a new set of subject terms as the set of subject terms
  • the subject term set subject term expansion is performed on the subject term set, and the calculation weights of the plurality of subject terms in the subject term set are updated; and the plurality of comments to be identified are updated to the unrecognized Successfully identified to-be-recognized comments;
  • the embodiment of the present application also provides a spam comment identification device, the device includes:
  • the similarity calculation module is configured to obtain multiple to-be-recognized comments and keyword sets corresponding to the target article, and calculate the relationship between each to-be-recognized comment and the target word according to the calculation weight of each theme word in the multiple theme words in the theme word set. Similarity between articles;
  • the comment identification module is configured to identify alternative spam comments and normal comments in the plurality of comments to be identified according to the similarity calculation result;
  • the subject term set updating module is configured to perform at least one of the following operations on the subject term set according to the identified normal comments if it is determined that there are unrecognized comments among the plurality of subject term sets, Obtaining a new subject term set as the subject term set: performing subject term expansion on the subject term set, and updating the calculation weights of the plurality of subject terms in the subject term set;
  • the comment to be identified is updated to the comment to be identified that has not been successfully identified;
  • the comment success identification module is set to return and execute the calculation weight according to each subject term in the subject term set, and calculate the similarity between each comment to be identified in the plurality of comments to be identified and the target article, until all the comments to be identified Comments were successfully identified.
  • the embodiment of the present application also provides a computer device, and the computer device includes:
  • processors one or more processors
  • a storage device for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the How to identify spam comments.
  • the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, wherein when the program is executed by a processor, the method for identifying spam comments as described in any embodiment of the present application is implemented.
  • Fig. 1 is a flowchart of a method for identifying spam comments in Embodiment 1 of the present application
  • Fig. 2 is a flowchart of a method for identifying spam comments in Embodiment 2 of the present application
  • Fig. 3a is a flow chart of a method for identifying spam comments in Embodiment 3 of the present application.
  • Fig. 3b is an overall block diagram of a spam comment identification method in Embodiment 3 of the present application.
  • Fig. 4 is a schematic structural diagram of an identification device for spam comments in Embodiment 4 of the present application.
  • FIG. 5 is a schematic structural diagram of a computer device in Embodiment 5 of the present application.
  • Fig. 1 is the flowchart of the method for identifying spam comments provided by Embodiment 1 of the present application.
  • This embodiment is applicable to the situation of identifying spam comments in Internet comment information, and the method can be executed by an identification device for spam comments.
  • the device can be implemented in the form of hardware and/or software, and can generally be integrated into a computer device with a function of identifying spam comments, for example, a terminal device or a server, etc.
  • the method specifically includes the following steps:
  • the comments to be identified refer to Internet comment information that corresponds to the target article and needs to be identified.
  • the subject term set refers to a term set composed of a plurality of subject terms corresponding to the target article.
  • the calculation weight of each subject term can be calculated according to the formula 1+log10(1+n), wherein, n represents the number of times the subject term appears in the target article.
  • Spam comments refer to comments with low similarity with the target article, that is, comments that are not strongly related to the target article; alternative spam comments refer to comments with a low initial judgment of similarity with the target article , the category needs to be confirmed in the next step to finalize its category; normal comments refer to comments with a high degree of similarity with the target article, that is, comments with a strong correlation with the target article.
  • multiple to-be-recognized comments can be classified to distinguish candidate spam comments from normal comments.
  • comments to be identified whose similarity calculation results are greater than or equal to a preset threshold can be directly determined as normal comments, and the similarity calculation results are less than or equal to Comments to be identified equal to a preset threshold (for example, 5%) are directly determined as candidate spam comments, and comments to be identified whose similarity calculation results are within a preset threshold (for example: 5%-90%) , determined as unrecognized comments to be recognized.
  • a preset threshold for example, 90%
  • the candidate spam comments may be directly determined as spam comments, or a secondary screening may be performed on the candidate spam comments, which is not limited in this embodiment.
  • Unrecognized unrecognized comments refer to the third category of comments that are neither spam candidates nor normal comments.
  • To expand the subject term set refers to adding newly selected subject terms to the subject term set.
  • the selection rules for new subject terms can be set according to actual needs.
  • the synonyms of the subject terms in the target article can be used as new Subject headings, which are not limited in this embodiment.
  • the unrecognized comments in this step refer to unrecognized unrecognized comments.
  • the weights of all newly expanded keywords in the expanded keyword set we can calculate the weight of each unsuccessfully identified one.
  • the similarity between the to-be-recognized comments and the target article is used to complete the identification of all unsuccessfully identified to-be-recognized comments. If all the unrecognized comments to be identified cannot be identified after the subject term set is expanded once, then the subject term set is expanded again until all the subject term sets are successfully identified.
  • the technical solution of the embodiment of the present application calculates the similarity between each comment to be identified and the target article by using the calculation weight of each of the multiple keywords in the keyword set, and according to the similarity calculation results, in multiple Alternative spam comments and normal comments are identified in the comments to be identified, if it is determined that there are unidentified comments to be identified in the plurality of comments to be identified, after updating the subject word set, the unidentified unidentified ones are identified again.
  • the comments are identified until all the comments to be identified are identified, so that multiple rounds of automatic identification of spam comments in Internet comment information can be realized, and the identification effect of spam comments is improved.
  • Fig. 2 is the flow chart of the method for identifying spam comments provided by the second embodiment of the present application.
  • This embodiment is based on the above-mentioned embodiment for refinement, wherein obtaining a plurality of comments to be identified corresponding to the target article includes: obtaining and All the comments corresponding to the target article, and each comment is matched with the network common words dictionary; according to the matching results, the alternative spam comments, the alternative normal comments, and the unrecognized comments are obtained, and the alternative normal comments and the described The comment cannot be identified, and it is determined as the plurality of comments to be identified.
  • the number of the identified candidate spam comments is multiple; after all the comments to be identified are successfully identified, it also includes: filtering the identified multiple candidate spam comments, and according to Filter the results and identify each candidate spam comment as spam or normal comment.
  • the method includes the following steps:
  • Internet common words refer to many conventional words, words or phrases appearing on the Internet, for example, such words as top, refueling, support, sofa, boredom, soy sauce, occupying a seat, and pouring water; Contains thesaurus of commonly used words on the Internet.
  • the length L of each comment is calculated and a threshold T is set to determine the length of the comment Evaluate, for example, 5 ⁇ T ⁇ 8; when L ⁇ T, the comment is a short comment, and the set of short comments is defined as ShorD, when L is greater than or equal to T, the comment is a non-short comment, a non-short comment
  • the short comments containing commonly used words on the Internet basically do not contain words related to the content of the target article, for this kind of short comments containing common words on the Internet, using the method of text similarity to identify their categories, the effect It must be bad. Therefore, in this embodiment, aiming at the length of the comments, firstly, the short comments are identified by using the network common phrases lexicon, and then the unidentifiable comments are identified by text similarity, so that no matter whether it is a short comment or a non-short comment, it can be identified. Comment spam identified.
  • S260 Go back and execute calculating the similarity between each of the plurality of unrecognized comments and the target article according to the calculation weight of each keyword in the keyword set, until all the unrecognized comments are successfully identified.
  • the candidate spam comments, the normal comments and unrecognizable comments are obtained, and the normal comments and unrecognizable comments are combined
  • the comment is determined as a comment to be identified, and the text similarity calculation method is used to identify the comment to be identified, and after the successful identification of all the comments to be identified, a secondary filtering process is performed on the identified multiple candidate spam comments, so that no matter Both short and non-short comments can identify spam comments, realize automatic identification of spam comments in Internet comment information, and improve the recognition effect of spam comments.
  • Secondary filtering refers to the use of common Internet terms and subject words to filter the candidate spam comments again by comparing the proportion of normal words in the candidate spam comments and the proportion of spam words in the total vocabulary in the candidate spam comments.
  • the proportion of the total vocabulary in the candidate spam comments is greater than or equal to a threshold, the comment is considered to be a normal comment; when the proportion of normal words to the total vocabulary in the candidate spam comments is less than the threshold, the comment is considered to be a spam comment , so as to reduce the possibility of normal comments being identified as spam comments.
  • the threshold can be set according to actual requirements, which is not limited in this embodiment.
  • Fig. 3a is a flow chart of the spam comment identification method provided by the third embodiment of the present application
  • Fig. 3b is an overall block diagram of the spam comment identification method provided by the third embodiment of the present application. This embodiment is based on the above-mentioned embodiment.
  • identifying the candidate spam comments and normal comments among the plurality of comments to be identified includes: obtaining the similarity calculation result corresponding to each comment to be identified; if it is determined that each comment to be identified If the corresponding similarity calculation result is less than or equal to the first threshold, it is determined that each comment to be identified is a candidate spam comment; if it is determined that the similarity calculation result corresponding to each comment to be identified is greater than or equal to the second threshold, Then it is determined that each comment to be identified is a normal comment; if it is determined that the similarity calculation result corresponding to each comment to be identified is greater than the first threshold and less than the second threshold, it is determined that each comment cannot be identified. The unidentified comments are successfully identified.
  • the method includes the following steps:
  • the calculation weight of each subject term includes: the weight of each subject term in the target article; according to the calculation weight of each subject term in a plurality of subject terms in the subject term set, calculate each The similarity between a comment to be identified and the target article P, including:
  • C k represents the vector of the kth comment to be identified
  • P represents the vector of the target article
  • n is the dimension of the vector
  • w i represents the weight of the topic word i in the target article
  • w ik represents the kth of the topic word i
  • S i represents the semantic information between words, in the first round of similarity calculation between the comments to be identified and the target article
  • S i is 1, and in the remaining rounds
  • Sim(P i ,C i'k ) indicates the similarity score between the subject word i in the kth comment to be recognized and the synonym i' of the subject term i in the target article, if they are the same word, then the value is 1;
  • LenP is the number of subject words in the target article
  • Same(P,C k ) is the number of subject words or synonyms of the subject words in the target article appearing in the kth comment to be recognized. because The value is a number not greater
  • n is the dimension of the vector C k of the kth comment to be identified and the vector P of the target article, and n is numerically equal to the number of subject words.
  • this embodiment proposes a method based on the similarity between words, word position information and
  • the improved cosine similarity formula of morphological similarity is a method to calculate the similarity between the comment to be recognized and the target article.
  • the improved specific formula is as follows:
  • the first threshold refers to a preset value used to evaluate the comment to be identified as a candidate spam comment.
  • the first threshold may be set according to specific actual requirements, which is not limited in this embodiment.
  • the second threshold refers to a preset value used to evaluate the comment to be identified as a normal comment.
  • the second threshold may be set according to specific actual requirements, which is not limited in this embodiment.
  • the first threshold is set to a value b
  • the second threshold is set to a value
  • the similarity calculation result corresponding to the comment to be identified is less than or equal to b
  • the comment to be identified is determined to be an alternative spam comment
  • the similarity calculation result corresponding to the comment to be identified is greater than or equal to a, it is determined that the comment to be identified is a normal comment; Review for successful identification.
  • the subject term set is expanded, including: obtaining the high-frequency terms included in the identified normal comments, and using the high-frequency terms as new subject terms Add to the set of keywords, and carry out weight setting for newly added keywords; count the frequency of occurrence of each keyword included in the new keyword set in the target article, according to the frequency of occurrence, in the co-occurrence words
  • the co-occurrence words associated with the occurrence frequency of at least one of the words in the new headword set obtained through library matching are added to the headword set as new words, and weight settings are performed for the newly added words.
  • Weight'(t r ) is the adjusted weight of high-frequency words; t r is the word that appears in normal comments; T(t r ) is the weight of word t r in normal comments, and the calculation formula is 1+log10 (1+n k ); T(k) is the number of normal comments including word t r after K rounds of similarity comparisons; N(k) is the total number of normal comments after K rounds of similarity comparisons.
  • the calculation weights of multiple keywords in the keyword set are updated, including:
  • the calculated weight is updated for each subject term in the subject term set.
  • Weight'(i) is the calculation weight of the updated word i
  • the word i is the subject word in the subject word set of the target article
  • the synonym i' is the synonym of the word i appearing in the normal comment
  • n p is the word i in the The number of occurrences in the target article
  • n k is the number of times word i appears in normal comments after K rounds of similarity comparison
  • T(k) is the number of normal comments including word i after K rounds of similarity comparison
  • N (k) is the total number of normal comments after K rounds of similarity comparison
  • N i' is the set of synonyms of word i
  • Weight(i') is the weight of the synonyms i' of word i in all normal comments
  • Sim( i, i') is the similarity score between word i and synonym i'
  • is an adjustment factor greater than 0, which adjusts the weight value and similarity of each synonym in the synonym set of word i to the effect of word i The degree of influence of
  • 1+log(1+n p + nk ) represents the word frequency of the word
  • adding 1 to the logarithm is to avoid the value calculated by the logarithm from zero, because the value of n p is likely to be 0, at this time
  • the value of n k is 1, the value calculated by log(n p +n k ) is 0, so add 1 to the logarithm, and, since the value calculated by the logarithm is generally less than 1, it will make The decrease of the value of the whole formula may have an adverse effect on the subsequent comment classification, so in this embodiment, 1 is added before the logarithm.
  • In the formula Indicates the ratio of the number of normal comments where word i appears to the total number of normal comments.
  • the calculation weight of the synonym i' of the term i may be adjusted according to the following formula, so as to realize the adjustment of the calculation weight of the subject term i.
  • i' is the synonym of the word i in the subject word set that appears in the comment
  • T(i') is the weight of the word i' in the normal comment, and the calculation formula is 1+log10(1+n k );
  • T(k) is the number of normal comments containing word i' after K rounds of similarity comparison;
  • N(k) is the total number of normal comments that appear in word i' after K rounds of similarity comparison;
  • T(i p ) is the number of normal comments for word i'
  • the weight of the synonym i p of ', the word i p is the word in the subject word set before calculating the weight adjustment.
  • N p is a set of synonyms of word i', which is obtained from the subject word set of the target article.
  • the above technical solution can make the recognition result of the comment to be recognized more accurate and reliable by continuously adjusting the calculation weight of the subject words.
  • the candidate spam comments, the normal comments and unrecognizable comments are obtained, and the normal comments and unrecognizable comments are combined
  • the comment is determined to be a comment to be identified, and on the basis of setting the first threshold and the second threshold, the text similarity calculation method is used to identify the comment to be identified, and after all the comments to be identified are successfully identified, the alternative spam comment Secondary filtering is performed, so that no matter whether it is a short comment or a non-short comment, the spam comment can be identified, and the automatic identification of the spam comment in the Internet comment information is realized, and the identification effect of the spam comment is improved.
  • FIG. 4 is a schematic structural diagram of an apparatus for identifying spam comments provided in Embodiment 4 of the present application.
  • the apparatus can implement a method for identifying spam comments involved in the above-mentioned embodiments.
  • This device can be implemented in the form of software and/or hardware.
  • the identification device of the spam comments includes: similarity calculation module 410, comment identification module 420, subject term set update module 430, comment success identification module 440.
  • the similarity calculation module 410 is configured to obtain a plurality of comments to be identified and keyword sets corresponding to the target article, and according to the calculation weight of each keyword in the plurality of keywords in the keyword set, calculate the relationship between each comment to be identified and The similarity between the target articles; the comment identification module 420 is set to identify alternative spam comments and normal comments in the plurality of comments to be identified according to the similarity calculation result; the subject term set update module 430 is set to if It is determined that there are unidentified comments to be identified among the plurality of comments to be identified, and then according to the identified normal comments, at least one of the following operations is performed on the set of subject terms to obtain a new set of subject terms as the Subject term set: expand the subject term set, and update the calculation weights of the plurality of subject terms in the subject term set, and update the plurality of comments to be identified as the unsuccessful Identify the comment to be identified; the comment success identification module 440 is configured to return to execute the calculation weight of each keyword in the subject word set, and calculate the weight between each of
  • the technical solution of the embodiment of the present application calculates the similarity between each comment to be identified and the target article by using the calculation weight of each keyword in the keyword set, and according to the similarity calculation result, among multiple comments to be identified Identify alternative spam comments and normal comments. If it is determined that there are unidentified unrecognized comments, after updating the subject word set, re-identify unrecognized unrecognized comments until all unrecognized comments are identified categories, which can realize multiple rounds of automatic identification of spam comments in Internet comment information, and improve the identification effect of spam comments.
  • the similarity calculation module 410 is configured to obtain all comments corresponding to the target article, and match each comment with the dictionary of commonly used words in the network; obtain alternative spam comments, alternative normal comments and The comment cannot be identified, and the candidate normal comment and the unidentified comment are determined as the plurality of comments to be identified.
  • the number of identified alternative spam comments is multiple; the identification device for spam comments also includes a secondary filtering module, which is configured to filter the identified multiple comments after all comments to be identified are successfully identified. Each candidate spam comment is filtered, and each identified candidate spam comment is identified as a spam comment or a normal comment according to the filtering results.
  • the comment identification module 420 is configured to obtain a similarity calculation result corresponding to each comment to be identified; if it is determined that the similarity calculation result corresponding to each comment to be identified is less than or equal to the first threshold, then determine Each comment to be identified is an alternative spam comment; if it is determined that the similarity calculation result corresponding to each comment to be identified is greater than or equal to a second threshold, then it is determined that each comment to be identified is a normal comment; if it is determined If the similarity calculation result of each comment to be identified is greater than the first threshold and less than the second threshold, then it is determined that each comment to be identified has not been successfully identified.
  • the calculation weight of each subject term includes: the weight of each subject term in the target article; the similarity calculation module 410 is set to use the formula Calculate the similarity between each comment to be identified and the target article; among them, C k represents the vector of the kth comment to be recognized, P represents the vector of the target article, n is the dimension of the vector, and w i represents the subject word i in The weight in the target article, w ik represents the weight of the topic word i in the k comment to be recognized, and S i represents the semantic information between words.
  • the subject term set update module 430 is configured to obtain the high-frequency terms included in the identified normal comments, add the high-frequency terms as new subject terms to the subject term set, and The newly added subject words carry out weight setting; Count the frequency of occurrence of each subject word included in the target article in the new subject words set, according to the frequency of occurrence, match the co-occurrence word thesaurus to obtain the new subject words The co-occurrence words associated with the frequency of occurrence of at least one subject term in the set are added to the subject term set as new subject terms, and weight settings are performed for the newly added subject terms.
  • the subject term set update module 430 is set to use the formula Update the calculated weight of each subject term in the subject term set; among them, Weight'(i) is the calculated weight of the updated term i, term i is the subject term in the subject term set of the target article, and the synonym i' is the normal Synonyms of word i appearing in comments; n p is the number of times word i appears in the target article; n k is the number of times word i appears in normal comments after K rounds of similarity comparison; T(k) is K rounds of similarity After degree comparison, the number of normal comments including word i; N(k) is the total number of normal comments after K rounds of similarity comparison; N i' is the set of synonyms for word i, and Weight(i') is the The weight of i's synonym i' in all normal comments, Sim(i,i') is the similarity score between word i and synonym i', ⁇ is an adjustment factor greater than 0.
  • the spam comment identification device provided in the embodiment of the present application can execute the spam comment identification method provided in any embodiment of the present application, and has corresponding functional modules for executing the method.
  • FIG. 5 is a schematic structural diagram of a computer device provided in Embodiment 5 of the present application.
  • the computer device includes a processor 510, a memory 520, an input device 530, and an output device 540;
  • the quantity can be one or more, and a processor 510 is taken as an example in FIG. Take the bus connection as an example.
  • the memory 520 can be used to store software programs, computer-executable programs and modules, such as program instructions/modules corresponding to the method for identifying spam comments in the embodiment of the present application (for example, the device for identifying spam comments).
  • the processor 510 executes various functional applications and data processing of the computer device by running the software programs, instructions and modules stored in the memory 520 , that is, realizes the above-mentioned method for identifying spam comments.
  • the method includes: obtaining a plurality of to-be-recognized comments and subject word sets corresponding to the target article, and calculating the relationship between each to-be-identified comment and the target article according to the calculation weight of each subject word in the plurality of subject words in the subject word set the similarity; according to the similarity calculation result, identify alternative spam comments and normal comments in the plurality of comments to be identified; if it is determined that there are unidentified comments to be identified in the plurality of comments to be identified, then according to For the identified normal comments, at least one of the following operations is performed on the set of subject terms to obtain a new set of subject terms as the set of subject terms: subject term expansion is performed on the subject term set, and the subject The calculation weights of the plurality of subject terms in the term set are updated, and the plurality of comments to be identified are updated as the unsuccessfully identified comments to be identified; return to execute according to the calculation weight of each subject term in the subject term set , calculating the similarity between each of the plurality of unidentified comments and
  • the memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the terminal, and the like.
  • the memory 520 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices.
  • the memory 520 may further include memory located remotely from the processor 510, and these remote memories may be connected to the computer device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the input device 530 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the computer equipment.
  • the output device 540 may include a display device such as a display screen.
  • Embodiment 6 of the present application also provides a storage medium containing computer-executable instructions, the computer-executable instructions are used to implement a method for identifying spam comments when executed by a computer processor, the method includes: obtaining and target articles Corresponding multiple to-be-recognized comments and keyword sets, and according to the calculation weight of each keyword in the multiple keyword sets in the keyword set, calculate the similarity between each to-be-identified comment and the target article; calculate according to the similarity As a result, alternative spam comments and normal comments are identified in the plurality of comments to be identified; Performing at least one of the following operations on the subject term set to obtain a new subject term set as the subject term set: performing subject term expansion on the subject term set, and performing the subject term expansion on the plurality of subject term sets in the subject term set The calculation weight of the subject words is updated, and the plurality of comments to be identified are updated to the comments to be identified that have not been successfully identified; return to execute according to the calculation weight of each subject term in the subject term set, calculate
  • a storage medium containing computer-executable instructions provided in the embodiments of the present application
  • the computer-executable instructions are not limited to the method operations described above, and can also implement the spam comment identification method provided in any embodiment of the present application Related operations in .
  • the present application can be implemented by means of software and necessary general-purpose hardware, and of course it can also be implemented by hardware, but in many cases the former is a better implementation mode .
  • the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product can be stored in a computer-readable storage medium, such as a floppy disk of a computer , read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or optical disc, etc., including several instructions to make a computer device (which can be a personal computer) , server, or network device, etc.) execute the methods described in multiple embodiments of the present application.
  • a computer-readable storage medium such as a floppy disk of a computer , read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or optical disc, etc.

Abstract

Disclosed in the embodiments of the present application are a spam comment identification method and apparatus, and a device and a medium. The method comprises: acquiring a plurality of comments to be identified and a subject word set, which correspond to a target article, and calculating the similarity between each comment to be identified and the target article according to the calculation weight of each subject word among a plurality of subject words in the subject word set; identifying, according to a similarity calculation result, candidate spam comments and normal comments from the plurality of comments to be identified; when it is determined that there are comments to be identified, which have not been successfully identified, among the plurality of comments to be identified, and according to the identified normal comments, performing at least one of the following operations on the subject word set to obtain a new subject word set to serve as the subject word set: performing subject word augmentation on the subject word set, updating the calculation weights of the plurality of subject words in the subject word set, and updating the plurality of comments to be identified to the comments to be identified that have not been successfully identified; and returning to execute the step of calculating the similarity between each of the plurality of comments to be identified and the target article according to the calculation weight of each of the subject words in the subject word set, until all the comments to be identified are successfully identified.

Description

垃圾评论的识别方法、装置、设备及介质Method, device, equipment and medium for identifying spam comments
本申请要求在2021年08月12日提交中国专利局、申请号为202110925078.4的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application with application number 202110925078.4 submitted to the China Patent Office on August 12, 2021, the entire content of which is incorporated herein by reference.
技术领域technical field
本申请实施例涉及大数据挖掘技术领域,例如涉及一种垃圾评论的识别方法、装置、设备及介质。The embodiments of the present application relate to the technical field of big data mining, for example, to a method, device, equipment and medium for identifying spam comments.
背景技术Background technique
随着互联网技术的快速发展,互联网中的评论信息呈爆炸式趋势增长,如何对互联网中的评论信息进行过滤,识别出垃圾评论已成为亟待解决的问题。With the rapid development of Internet technology, the comment information on the Internet is growing explosively. How to filter the comment information on the Internet and identify spam comments has become an urgent problem to be solved.
相关技术中,对于网络垃圾评论,阻止和识别垃圾评论的方法主要分为人工识别的方法和自动识别的方法两大类。其中,自动识别的方法又可分为基于训练集的分类识别方法和基于相似度的识别方法。In related technologies, for network spam comments, methods for preventing and identifying spam comments are mainly divided into two categories: manual identification methods and automatic identification methods. Among them, the automatic identification method can be further divided into classification identification method based on training set and identification method based on similarity.
然而,人工识别的方法只能对新发表的评论进行识别,过滤掉新发表的评论中的垃圾评论,对已发表的垃圾评论却无能为力;人工识别的方法需要不断地进行人工维护,不太方便;并且,垃圾制造者可以利用多种代理方法欺骗过滤机制。基于训练集的分类方法,由于网络的便利,评论更新的速度比较快,特征词的变化很大,所以为了使分类器能更准确的识别出垃圾评论,训练样本就要随着这种变化而改变,训练样本发生变化,特征项就要重新选择,就要重新对特征项进行权重的计算和抽取,这严重影响了系统运行的效率,同时带来了不便。However, the method of artificial recognition can only identify newly published comments, filter out the spam comments in the newly published comments, but can do nothing for the published spam comments; the method of artificial recognition requires continuous manual maintenance, which is not very convenient ; Moreover, spammers can use a variety of proxy methods to cheat filtering mechanisms. The classification method based on the training set, due to the convenience of the network, the comment update speed is relatively fast, and the feature words change a lot, so in order to make the classifier more accurately identify spam comments, the training samples must be changed with this change If the training sample changes, the feature item must be reselected, and the weight of the feature item must be recalculated and extracted, which seriously affects the efficiency of the system operation and brings inconvenience.
发明内容Contents of the invention
本申请实施例提供一种垃圾评论的识别方法、装置、设备及介质,以实现对互联网评论信息中的垃圾评论进行自动识别。The embodiments of the present application provide a method, device, device and medium for identifying spam comments, so as to realize automatic identification of spam comments in Internet comment information.
本申请实施例提供了一种垃圾评论的识别方法,包括:The embodiment of the present application provides a method for identifying spam comments, including:
获取与目标文章对应的多个待识别评论和主题词集,并根据主题词集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度;Obtain multiple to-be-recognized comments and subject word sets corresponding to the target article, and calculate the similarity between each to-be-identified comment and the target article according to the calculation weight of each subject word in the plurality of subject words in the subject term set;
根据相似度计算结果,在所述多个待识别评论中识别出备选垃圾评论和正常评论;Identifying alternative spam comments and normal comments among the plurality of comments to be identified according to the similarity calculation result;
如果确定所述多个待识别评论中存在未成功识别的待识别评论,则根据识 别出的正常评论,对所述主题词集进行以下操作中的至少之一,得到新的主题词集作为所述主题词集:对所述主题词集进行主题词扩充,和对所述主题词集中的所述多个主题词的计算权重进行更新;并将所述多个待识别评论更新为所述未成功识别的待识别评论;If it is determined that there are unidentified comments to be identified in the plurality of comments to be identified, then according to the identified normal comments, at least one of the following operations is performed on the set of subject terms to obtain a new set of subject terms as the set of subject terms The subject term set: subject term expansion is performed on the subject term set, and the calculation weights of the plurality of subject terms in the subject term set are updated; and the plurality of comments to be identified are updated to the unrecognized Successfully identified to-be-recognized comments;
返回执行根据主题词集中的每个主题词的计算权重,计算所述多个待识别评论中的每个待识别评论与目标文章之间的相似度,直至全部待识别评论被成功识别。Returning to the execution of calculating the similarity between each of the plurality of unrecognized comments and the target article according to the calculation weight of each keyword in the keyword set, until all the unrecognized comments are successfully identified.
本申请实施例还提供了一种垃圾评论的识别装置,该装置包括:The embodiment of the present application also provides a spam comment identification device, the device includes:
相似度计算模块,设置为获取与目标文章对应的多个待识别评论和主题词集,并根据主题词集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度;The similarity calculation module is configured to obtain multiple to-be-recognized comments and keyword sets corresponding to the target article, and calculate the relationship between each to-be-recognized comment and the target word according to the calculation weight of each theme word in the multiple theme words in the theme word set. Similarity between articles;
评论识别模块,设置为根据相似度计算结果,在所述多个待识别评论中识别出备选垃圾评论和正常评论;The comment identification module is configured to identify alternative spam comments and normal comments in the plurality of comments to be identified according to the similarity calculation result;
主题词集更新模块,设置为如果确定所述多个待识别评论中存在未成功识别的待识别评论,则根据识别出的正常评论,对所述主题词集进行以下操作中的至少之一,得到新的主题词集作为所述主题词集:对所述主题词集进行主题词扩充,和对所述主题词集中的所述多个主题词的计算权重进行更新;并将所述多个待识别评论更新为所述未成功识别的待识别评论;The subject term set updating module is configured to perform at least one of the following operations on the subject term set according to the identified normal comments if it is determined that there are unrecognized comments among the plurality of subject term sets, Obtaining a new subject term set as the subject term set: performing subject term expansion on the subject term set, and updating the calculation weights of the plurality of subject terms in the subject term set; The comment to be identified is updated to the comment to be identified that has not been successfully identified;
评论成功识别模块,设置为返回执行根据主题词集中的每个主题词的计算权重,计算所述多个待识别评论中的每个待识别评论与目标文章之间的相似度,直至全部待识别评论被成功识别。The comment success identification module is set to return and execute the calculation weight according to each subject term in the subject term set, and calculate the similarity between each comment to be identified in the plurality of comments to be identified and the target article, until all the comments to be identified Comments were successfully identified.
本申请实施例还提供了一种计算机设备,所述计算机设备包括:The embodiment of the present application also provides a computer device, and the computer device includes:
一个或多个处理器;one or more processors;
存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,所述一个或多个处理器实现如本申请任一实施例所述的垃圾评论的识别方法。A storage device for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the How to identify spam comments.
本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现如本申请任一实施例所述的垃圾评论的识别方法。The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, wherein when the program is executed by a processor, the method for identifying spam comments as described in any embodiment of the present application is implemented.
附图说明Description of drawings
图1是本申请实施例一中的一种垃圾评论的识别方法的流程图;Fig. 1 is a flowchart of a method for identifying spam comments in Embodiment 1 of the present application;
图2是本申请实施例二中的一种垃圾评论的识别方法的流程图;Fig. 2 is a flowchart of a method for identifying spam comments in Embodiment 2 of the present application;
图3a是本申请实施例三中的一种垃圾评论的识别方法的流程图;Fig. 3a is a flow chart of a method for identifying spam comments in Embodiment 3 of the present application;
图3b是本申请实施例三中的一种垃圾评论识别方法的总体框图;Fig. 3b is an overall block diagram of a spam comment identification method in Embodiment 3 of the present application;
图4是本申请实施例四中的一种垃圾评论的识别装置的结构示意图;Fig. 4 is a schematic structural diagram of an identification device for spam comments in Embodiment 4 of the present application;
图5是本申请实施例五中的一种计算机设备的结构示意图。FIG. 5 is a schematic structural diagram of a computer device in Embodiment 5 of the present application.
具体实施方式Detailed ways
下面结合附图和实施例对本申请作说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本申请,而非对本申请的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本申请相关的部分而非全部结构。The application will be described below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present application, but not to limit the present application. In addition, it should be noted that, for the convenience of description, only some structures related to the present application are shown in the drawings but not all structures.
实施例一Embodiment one
图1为本申请实施例一提供的垃圾评论的识别方法的流程图,本实施例可适用于对互联网评论信息中的垃圾评论进行识别的情况,该方法可以由垃圾评论的识别装置来执行,该装置可采用硬件和/或软件的方式实现,并一般可以集成在具有垃圾评论识别功能的计算机设备中,例如,终端设备或服务器等,所述方法具体包括如下步骤:Fig. 1 is the flowchart of the method for identifying spam comments provided by Embodiment 1 of the present application. This embodiment is applicable to the situation of identifying spam comments in Internet comment information, and the method can be executed by an identification device for spam comments. The device can be implemented in the form of hardware and/or software, and can generally be integrated into a computer device with a function of identifying spam comments, for example, a terminal device or a server, etc. The method specifically includes the following steps:
S110、获取与目标文章对应的多个待识别评论和主题词集,并根据主题词集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度。S110. Acquire multiple to-be-recognized comments and keyword sets corresponding to the target article, and calculate the similarity between each to-be-identified comment and the target article according to the calculation weight of each of the multiple keyword words in the keyword set Spend.
待识别评论指的是与目标文章对应的,要进行识别处理的互联网评论信息。主题词集指的是由与目标文章对应的多个主题词所组成的词集。The comments to be identified refer to Internet comment information that corresponds to the target article and needs to be identified. The subject term set refers to a term set composed of a plurality of subject terms corresponding to the target article.
示例性的,每个主题词的计算权重可以根据公式1+log10(1+n)进行计算,其中,n表示主题词在目标文章中出现的次数。Exemplarily, the calculation weight of each subject term can be calculated according to the formula 1+log10(1+n), wherein, n represents the number of times the subject term appears in the target article.
S120、根据相似度计算结果,在所述多个待识别评论中识别出备选垃圾评论和正常评论。S120. According to the similarity calculation result, identify candidate spam comments and normal comments from the plurality of comments to be identified.
垃圾评论指的是与目标文章之间的相似度较低的评论,即与目标文章关联性不强的评论;备选垃圾评论指的是与目标文章之间的相似度初步判断较低的评论,需进行下一步的确认才能对其类别进行最终确定;正常评论指的是与目标文章之间的相似度较高的评论,即与目标文章关联性较强的评论。Spam comments refer to comments with low similarity with the target article, that is, comments that are not strongly related to the target article; alternative spam comments refer to comments with a low initial judgment of similarity with the target article , the category needs to be confirmed in the next step to finalize its category; normal comments refer to comments with a high degree of similarity with the target article, that is, comments with a strong correlation with the target article.
根据每个待识别评论与目标文章之间的相似度计算结果,可以将多个待识别评论进行分类,以区分出备选垃圾评论和正常评论。According to the calculation result of the similarity between each to-be-recognized comment and the target article, multiple to-be-recognized comments can be classified to distinguish candidate spam comments from normal comments.
在本实施例的一个可选的实时方式中,可以将相似度计算结果大于或者等 于预设阈值(例如,90%)的待识别评论,直接确定为正常评论,而将相似度计算结果小于或者等于预设阈值(例如,5%)的待识别评论,直接确定为备选垃圾评论,而将相似度计算结果处于预设的一个阈值范围内(例如:5%-90%)的待识别评论,确定为未成功识别的待识别评论。In an optional real-time mode of this embodiment, comments to be identified whose similarity calculation results are greater than or equal to a preset threshold (for example, 90%) can be directly determined as normal comments, and the similarity calculation results are less than or equal to Comments to be identified equal to a preset threshold (for example, 5%) are directly determined as candidate spam comments, and comments to be identified whose similarity calculation results are within a preset threshold (for example: 5%-90%) , determined as unrecognized comments to be recognized.
在实施例中,可以直接将备选垃圾评论确定为垃圾评论,也可以对备选垃圾评论进行二次筛选过滤,本实施例对此并不进行限制。In the embodiment, the candidate spam comments may be directly determined as spam comments, or a secondary screening may be performed on the candidate spam comments, which is not limited in this embodiment.
S130、如果确定所述多个待识别评论中存在未成功识别的待识别评论,则根据识别出的正常评论,对所述主题词集进行以下操作中的至少之一,得到新的主题词集作为所述主题词集:对所述主题词集进行主题词扩充,和对所述主题词集中的所述多个主题词的计算权重进行更新,并将所述多个待识别评论更新为所述未成功识别的待识别评论。S130. If it is determined that there are unrecognized comments among the plurality of unrecognized comments, according to the identified normal comments, perform at least one of the following operations on the subject term set to obtain a new subject term set As the subject term set: subject term expansion is performed on the subject term set, and the calculation weights of the plurality of subject terms in the subject term set are updated, and the plurality of comments to be identified are updated to all Unrecognized comments that were not successfully identified.
未成功识别的待识别评论指的是既不是备选垃圾评论也不是正常评论的第三种类别评论。对主题词集进行主题词扩充指的是将新选取的主题词加入至主题词集中,新主题词的选取规则可根据实际需求设定,例如,可以将目标文章中主题词的近义词作为新的主题词,本实施例对此不进行限制。Unrecognized unrecognized comments refer to the third category of comments that are neither spam candidates nor normal comments. To expand the subject term set refers to adding newly selected subject terms to the subject term set. The selection rules for new subject terms can be set according to actual needs. For example, the synonyms of the subject terms in the target article can be used as new Subject headings, which are not limited in this embodiment.
S140、返回执行根据主题词集中的每个主题词的计算权重,计算所述多个待识别评论中的每个待识别评论与目标文章之间的相似度,直至全部待识别评论被成功识别。S140 , return to perform calculation of the similarity between each of the plurality of unrecognized comments and the target article according to the calculation weight of each keyword in the keyword set, until all the unrecognized comments are successfully identified.
值得注意的是,该步骤中的待识别评论指的是未成功识别的待识别评论,通过根据扩充后的主题词集中的全部新扩充的主题词的计算权重,可以计算出每个未成功识别的待识别评论与目标文章之间的相似度,以完成对所有未成功识别的待识别评论进行识别。若主题词集扩充一次后仍不能将所有的未成功识别的待识别评论进行识别,则再次对主题词集进行扩充,直至将全部待识别评论成功识别。It is worth noting that the unrecognized comments in this step refer to unrecognized unrecognized comments. By calculating the weights of all newly expanded keywords in the expanded keyword set, we can calculate the weight of each unsuccessfully identified one. The similarity between the to-be-recognized comments and the target article is used to complete the identification of all unsuccessfully identified to-be-recognized comments. If all the unrecognized comments to be identified cannot be identified after the subject term set is expanded once, then the subject term set is expanded again until all the subject term sets are successfully identified.
本申请实施例的技术方案通过利用主题词集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度,并根据相似度计算结果,在多个待识别评论中识别出备选垃圾评论和正常评论,如果确定所述多个待识别评论中存在未成功识别的待识别评论,则在对主题词集更新后再次对未成功识别的待识别评论进行识别,直至所有待识别评论均被识别出类别,能够实现对互联网评论信息中的垃圾评论进行多轮自动识别,提高了垃圾评论的识别效果。The technical solution of the embodiment of the present application calculates the similarity between each comment to be identified and the target article by using the calculation weight of each of the multiple keywords in the keyword set, and according to the similarity calculation results, in multiple Alternative spam comments and normal comments are identified in the comments to be identified, if it is determined that there are unidentified comments to be identified in the plurality of comments to be identified, after updating the subject word set, the unidentified unidentified ones are identified again. The comments are identified until all the comments to be identified are identified, so that multiple rounds of automatic identification of spam comments in Internet comment information can be realized, and the identification effect of spam comments is improved.
实施例二Embodiment two
图2为本申请实施例二提供的垃圾评论的识别方法的流程图,本实施例以上述实施例为基础进行细化,其中,获取与目标文章对应的多个待识别评论,包括:获取与目标文章对应的全部评论,并将每个评论与网络常用语词库进行匹配;根据匹配结果得到备选垃圾评论、备选正常评论以及无法识别评论,并将所述备选正常评论和所述无法识别评论,确定为所述多个待识别评论。Fig. 2 is the flow chart of the method for identifying spam comments provided by the second embodiment of the present application. This embodiment is based on the above-mentioned embodiment for refinement, wherein obtaining a plurality of comments to be identified corresponding to the target article includes: obtaining and All the comments corresponding to the target article, and each comment is matched with the network common words dictionary; according to the matching results, the alternative spam comments, the alternative normal comments, and the unrecognized comments are obtained, and the alternative normal comments and the described The comment cannot be identified, and it is determined as the plurality of comments to be identified.
可选的,所述识别出的备选垃圾评论的数量为多个;在全部待识别评论被成功识别之后,还包括:对识别出的所述多个备选垃圾评论进行过滤处理,并根据过滤结果,将识别出的每个备选垃圾评论识别为垃圾评论或者正常评论。Optionally, the number of the identified candidate spam comments is multiple; after all the comments to be identified are successfully identified, it also includes: filtering the identified multiple candidate spam comments, and according to Filter the results and identify each candidate spam comment as spam or normal comment.
如图2所示,所述方法包括如下步骤:As shown in Figure 2, the method includes the following steps:
S210、获取与目标文章对应的全部评论,并将每个评论与网络常用语词库进行匹配。S210. Obtain all comments corresponding to the target article, and match each comment with a dictionary of commonly used words on the Internet.
网络常用语指的是网络上出现的许多约定俗成的字、词或是短语,例如,顶、加油、支持、沙发、无聊、打酱油、占座和灌水等词语;网络常用语词库指的是包含网络常用语的词库。Internet common words refer to many conventional words, words or phrases appearing on the Internet, for example, such words as top, refueling, support, sofa, boredom, soy sauce, occupying a seat, and pouring water; Contains thesaurus of commonly used words on the Internet.
S220、根据匹配结果得到备选垃圾评论、备选正常评论以及无法识别评论,并将所述备选正常评论和所述无法识别评论,确定为所述多个待识别评论。S220. Obtain candidate spam comments, candidate normal comments, and unrecognizable comments according to the matching result, and determine the candidate normal comments and the unrecognizable comments as the plurality of comments to be identified.
备选垃圾评论指的是短小的垃圾评论;备选正常评论指的是短小的正常评论;无法识别评论指的是所有非短小的评论。Alternative spam is short spam comments; Alternative normal comments are short normal comments; Unrecognized comments are all non-short comments.
在一个可选的实施方式中,在对与目标文章对应的全部评论进行分词、保留词性、去重和去除停用词后,计算每条评论的长度L并设置一个阈值T以对评论的长短进行评估,例如,5≤T≤8;当L<T时,评论就属于是短小的评论,短小评论的集合定义为ShorD,当L大于等于T时,评论是非短小的评论,非短小评论的集合是LongD;对于集合ShorD中的每一条评论,与网络常用语词库中的词语进行查找与匹配,匹配的网络正常词语个数记为num1,匹配的网络垃圾词语个数记为num2;如果num1>=num2,则把该评论标记为备选正常评论,如果num1<num2,则把该评论标记为备选垃圾评论。In an optional implementation, after word segmentation, part of speech preservation, repetition removal, and stop word removal are performed on all comments corresponding to the target article, the length L of each comment is calculated and a threshold T is set to determine the length of the comment Evaluate, for example, 5≤T≤8; when L<T, the comment is a short comment, and the set of short comments is defined as ShorD, when L is greater than or equal to T, the comment is a non-short comment, a non-short comment The set is LongD; for each comment in the set ShorD, it is searched and matched with the words in the network common language lexicon, the number of matched network normal words is recorded as num1, and the number of matched network spam words is recorded as num2; if num1>=num2, the comment is marked as an alternative normal comment, and if num1<num2, the comment is marked as an alternative spam comment.
由于包含网络常用语的短小评论中基本上是不会包含与目标文章内容有关的词汇,所以对这种包含网络常用语的短小评论来说,利用文本相似度的方法来识别其类别,其效果肯定是不好的。因此,在本实施例中,针对评论的长度问题,首先利用网络常用语词库对短小评论进行识别,再利用文本相似度将无法识别评论进行识别,使得无论是短小评论还是非短小评论都能识别出垃圾评论。Since the short comments containing commonly used words on the Internet basically do not contain words related to the content of the target article, for this kind of short comments containing common words on the Internet, using the method of text similarity to identify their categories, the effect It must be bad. Therefore, in this embodiment, aiming at the length of the comments, firstly, the short comments are identified by using the network common phrases lexicon, and then the unidentifiable comments are identified by text similarity, so that no matter whether it is a short comment or a non-short comment, it can be identified. Comment spam identified.
S230、获取与目标文章对应的多个待识别评论和主题词集,并根据主题词 集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度。S230. Acquire multiple to-be-recognized comments and keyword sets corresponding to the target article, and calculate the similarity between each to-be-identified comment and the target article according to the calculation weight of each of the multiple keyword words in the keyword set Spend.
S240、根据相似度计算结果,在所述多个待识别评论中识别出备选垃圾评论和正常评论。S240. According to the similarity calculation result, identify candidate spam comments and normal comments from the plurality of comments to be identified.
S250、如果确定所述多个待识别评论中存在未成功识别的待识别评论,则根据识别出的正常评论,对所述主题词集进行以下操作中的至少之一,得到新的主题词集作为所述主题词集:对所述主题词集进行主题词扩充,和对所述主题词集中的所述多个主题词的计算权重进行更新,并将所述多个待识别评论更新为所述未成功识别的待识别评论。S250. If it is determined that there are unrecognized comments among the plurality of unrecognized comments, according to the identified normal comments, perform at least one of the following operations on the subject term set to obtain a new subject term set As the subject term set: subject term expansion is performed on the subject term set, and the calculation weights of the plurality of subject terms in the subject term set are updated, and the plurality of comments to be identified are updated to all Unrecognized comments that were not successfully identified.
S260、返回执行根据主题词集中的每个主题词的计算权重,计算所述多个待识别评论中的每个待识别评论与目标文章之间的相似度,直至全部待识别评论被成功识别。S260. Go back and execute calculating the similarity between each of the plurality of unrecognized comments and the target article according to the calculation weight of each keyword in the keyword set, until all the unrecognized comments are successfully identified.
本实施例未尽详细解释之处请参见前述实施例,在此不再赘述。For details not explained in this embodiment, please refer to the foregoing embodiments, and details are not repeated here.
S270、对识别出的所述多个备选垃圾评论进行过滤处理,并根据过滤结果,将识别出的每个备选垃圾评论识别为垃圾评论或者正常评论。S270. Filter the multiple identified candidate spam comments, and identify each identified candidate spam comment as a spam comment or a normal comment according to the filtering result.
本申请实施例的技术方案,通过将与目标文章对应的全部评论与网络常用语词库进行匹配,得到备选垃圾评论、备选正常评论以及无法识别评论,并将备选正常评论和无法识别评论确定为待识别评论,利用文本相似度的计算方法对待识别评论进行识别,并在对全部待识别评论的成功识别之后,对识别出的多个备选垃圾评论进行二次过滤处理,使得无论是短小评论还是非短小评论都能识别出垃圾评论,实现对互联网评论信息中的垃圾评论进行自动识别,提高了垃圾评论的识别效果。In the technical solution of the embodiment of the present application, by matching all the comments corresponding to the target article with the dictionary of commonly used words in the network, the candidate spam comments, the normal comments and unrecognizable comments are obtained, and the normal comments and unrecognizable comments are combined The comment is determined as a comment to be identified, and the text similarity calculation method is used to identify the comment to be identified, and after the successful identification of all the comments to be identified, a secondary filtering process is performed on the identified multiple candidate spam comments, so that no matter Both short and non-short comments can identify spam comments, realize automatic identification of spam comments in Internet comment information, and improve the recognition effect of spam comments.
二次过滤指的是利用网络常用语和主题词,通过比较备选垃圾评论中的正常词汇与垃圾词汇占备选垃圾评论中总词汇的比重来对备选垃圾评论进行再次过滤,当正常词汇占备选垃圾评论中总词汇的比重大于或等于一个阈值时,则认定该评论为正常评论;当正常词汇占备选垃圾评论中总词汇的比重小于该阈值时,则认定该评论为垃圾评论,以此减少正常评论被识别为垃圾评论的可能性。阈值可以根据实际要求设定,本实施例对此不进行限制。Secondary filtering refers to the use of common Internet terms and subject words to filter the candidate spam comments again by comparing the proportion of normal words in the candidate spam comments and the proportion of spam words in the total vocabulary in the candidate spam comments. When the proportion of the total vocabulary in the candidate spam comments is greater than or equal to a threshold, the comment is considered to be a normal comment; when the proportion of normal words to the total vocabulary in the candidate spam comments is less than the threshold, the comment is considered to be a spam comment , so as to reduce the possibility of normal comments being identified as spam comments. The threshold can be set according to actual requirements, which is not limited in this embodiment.
实施例三Embodiment three
图3a为本申请实施例三提供的垃圾评论的识别方法的流程图,图3b为本申请实施例三提供的垃圾评论识别方法的总体框图,本实施例以上述实施例为 基础进行细化,其中,根据相似度计算结果,在多个待识别评论中识别出备选垃圾评论和正常评论,包括:获取与每个待识别评论对应的相似度计算结果;如果确定所述每个待识别评论对应的相似度计算结果小于或者等于第一阈值,则确定所述每个待识别评论为备选垃圾评论;如果确定所述每个待识别评论对应的相似度计算结果大于或者等于第二阈值,则确定所述每个待识别评论为正常评论;如果确定所述每个待识别评论对应的相似度计算结果大于所述第一阈值且小于所述第二阈值,则确定未能对所述每个待识别评论进行成功识别。Fig. 3a is a flow chart of the spam comment identification method provided by the third embodiment of the present application, and Fig. 3b is an overall block diagram of the spam comment identification method provided by the third embodiment of the present application. This embodiment is based on the above-mentioned embodiment. Wherein, according to the similarity calculation result, identifying the candidate spam comments and normal comments among the plurality of comments to be identified includes: obtaining the similarity calculation result corresponding to each comment to be identified; if it is determined that each comment to be identified If the corresponding similarity calculation result is less than or equal to the first threshold, it is determined that each comment to be identified is a candidate spam comment; if it is determined that the similarity calculation result corresponding to each comment to be identified is greater than or equal to the second threshold, Then it is determined that each comment to be identified is a normal comment; if it is determined that the similarity calculation result corresponding to each comment to be identified is greater than the first threshold and less than the second threshold, it is determined that each comment cannot be identified. The unidentified comments are successfully identified.
如图3a所示,所述方法包括如下步骤:As shown in Figure 3a, the method includes the following steps:
S310、获取与目标文章对应的全部评论,并将每个评论与网络常用语词库进行匹配。S310. Obtain all the comments corresponding to the target article, and match each comment with the dictionary of commonly used words on the Internet.
S320、根据匹配结果得到备选垃圾评论、备选正常评论以及无法识别评论,并将所述备选正常评论和所述无法识别评论,确定为所述多个待识别评论。S320. Obtain candidate spam comments, candidate normal comments, and unrecognizable comments according to the matching result, and determine the candidate normal comments and the unrecognizable comments as the plurality of comments to be identified.
S330、获取与目标文章对应的多个待识别评论和主题词集,并根据主题词集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度。S330. Obtain multiple to-be-recognized comments and keyword sets corresponding to the target article, and calculate the similarity between each to-be-identified comment and the target article according to the calculation weight of each of the multiple keyword words in the keyword set Spend.
可选的,其中,所述每个主题词的计算权重包括:每个主题词在目标文章中的权重;所述根据主题词集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章P之间的相似度,包括:Optionally, wherein, the calculation weight of each subject term includes: the weight of each subject term in the target article; according to the calculation weight of each subject term in a plurality of subject terms in the subject term set, calculate each The similarity between a comment to be identified and the target article P, including:
利用公式
Figure PCTCN2022108563-appb-000001
计算每个待识别评论与目标文章之间的相似度。
use the formula
Figure PCTCN2022108563-appb-000001
Calculate the similarity between each comment to be identified and the target article.
其中,C k表示第k条待识别评论的向量,P表示目标文章的向量,n为向量的维数,w i表示主题词i在目标文章中的权重,w ik表示主题词i在第k条待识别评论中的权重,S i表示词语间的语义信息,在进行首轮待识别评论与目标文章之间的相似度计算时,S i为1,其余轮次中
Figure PCTCN2022108563-appb-000002
Sim(P i,C i'k)表示第k条待识别评论中的主题词i与目标文章中的主题词i的近义词i'之间的相似度分值,如果是同一个词语,那么值为1;
Figure PCTCN2022108563-appb-000003
表示词形相似度,LenP是目标文章中主题词的个数,Same(P,C k)是第k条待识别评论中出现的目标文章中主题词或者主题词的近义词的个数。由于
Figure PCTCN2022108563-appb-000004
值是不大于1的数,乘以它后会使整个 式子的值减小,影响相似度分值,所以式子中加上了平滑因子0.5。
Among them, C k represents the vector of the kth comment to be identified, P represents the vector of the target article, n is the dimension of the vector, w i represents the weight of the topic word i in the target article, and w ik represents the kth of the topic word i The weights in the comments to be identified, S i represents the semantic information between words, in the first round of similarity calculation between the comments to be identified and the target article, S i is 1, and in the remaining rounds
Figure PCTCN2022108563-appb-000002
Sim(P i ,C i'k ) indicates the similarity score between the subject word i in the kth comment to be recognized and the synonym i' of the subject term i in the target article, if they are the same word, then the value is 1;
Figure PCTCN2022108563-appb-000003
Indicates word form similarity, LenP is the number of subject words in the target article, Same(P,C k ) is the number of subject words or synonyms of the subject words in the target article appearing in the kth comment to be recognized. because
Figure PCTCN2022108563-appb-000004
The value is a number not greater than 1. Multiplying it will reduce the value of the entire formula and affect the similarity score, so a smoothing factor of 0.5 is added to the formula.
本实施例中,n为第k条待识别评论的向量C k以及目标文章的向量P的维数,n在数值上等于主题词的个数。 In this embodiment, n is the dimension of the vector C k of the kth comment to be identified and the vector P of the target article, and n is numerically equal to the number of subject words.
此外,在其余轮次中为弥补传统相似度方法无法识别近义词的不足,提高待识别评论与目标文章之间的相似度分值,本实施例中提出了基于词语间相似度、词语位置信息和词形相似度改进的余弦相似度公式计算待识别评论与目标文章相似度的方法。改进后的具体公式如下所示:In addition, in the remaining rounds, in order to make up for the inability of the traditional similarity method to identify synonyms and improve the similarity score between the comment to be identified and the target article, this embodiment proposes a method based on the similarity between words, word position information and The improved cosine similarity formula of morphological similarity is a method to calculate the similarity between the comment to be recognized and the target article. The improved specific formula is as follows:
Figure PCTCN2022108563-appb-000005
Figure PCTCN2022108563-appb-000005
其中,Similarity'(P,C k)为改进后的第k条待识别评论C k与目标文章P之间的相似度,w′ i=w i*L(t),w′ ik=w ik*L(t),L(t)表示主题词i在目标文章中的位置。 Among them, Similarity'(P,C k ) is the similarity between the improved k-th comment C k to be identified and the target article P, w' i =w i *L(t), w' ik =wi ik *L(t), L(t) represents the position of the subject word i in the target article.
S340、获取与每个待识别评论对应的相似度计算结果。S340. Obtain a similarity calculation result corresponding to each comment to be identified.
S350、如果确定所述每个待识别评论对应的相似度计算结果小于或者等于第一阈值,则确定所述每个待识别评论为备选垃圾评论。S350. If it is determined that the similarity calculation result corresponding to each comment to be identified is less than or equal to the first threshold, determine that each comment to be identified is a candidate spam comment.
第一阈值指的是预先设定的,用于将待识别评论评估为备选垃圾评论的数值。可根据具体实际要求对第一阈值进行设定,本实施例对此不进行限制。The first threshold refers to a preset value used to evaluate the comment to be identified as a candidate spam comment. The first threshold may be set according to specific actual requirements, which is not limited in this embodiment.
S360、如果确定所述每个待识别评论对应的相似度计算结果大于或者等于第二阈值,则确定所述每个待识别评论为正常评论。S360. If it is determined that the similarity calculation result corresponding to each comment to be identified is greater than or equal to a second threshold, determine that each comment to be identified is a normal comment.
第二阈值指的是预先设定的,用于将待识别评论评估为正常评论的数值。可根据具体实际要求对第二阈值进行设定,本实施例对此不进行限制。The second threshold refers to a preset value used to evaluate the comment to be identified as a normal comment. The second threshold may be set according to specific actual requirements, which is not limited in this embodiment.
S370、如果确定所述每个待识别评论对应的相似度计算结果大于所述第一阈值且小于所述第二阈值,则确定未能对所述每个待识别评论进行成功识别。S370. If it is determined that the similarity calculation result corresponding to each comment to be identified is greater than the first threshold and less than the second threshold, determine that each comment to be identified has not been successfully identified.
示例性的,将第一阈值设定为数值b,将第二阈值设定为数值a,若待识别评论对应的相似度计算结果小于或者等于b,则确定该待识别评论为备选垃圾评论;若待识别评论对应的相似度计算结果大于或者等于a,则确定该待识别评论为正常评论;若待识别评论对应的相似度计算结果大于b且小于a,则确定未能对该待识别评论进行成功识别。Exemplarily, the first threshold is set to a value b, the second threshold is set to a value, and if the similarity calculation result corresponding to the comment to be identified is less than or equal to b, then the comment to be identified is determined to be an alternative spam comment ; If the similarity calculation result corresponding to the comment to be identified is greater than or equal to a, it is determined that the comment to be identified is a normal comment; Review for successful identification.
S380、如果确定所述多个待识别评论中存在未成功识别的待识别评论,则根据识别出的正常评论,对所述主题词集进行以下操作中的至少之一,得到新的主题词集作为所述主题词集:对所述主题词集进行主题词扩充,和对所述主题词集中的所述多个主题词的计算权重进行更新,并将所述多个待识别评论更新为所述未成功识别的待识别评论。S380. If it is determined that there are unrecognized comments among the plurality of unrecognized comments, according to the identified normal comments, perform at least one of the following operations on the subject term set to obtain a new subject term set As the subject term set: subject term expansion is performed on the subject term set, and the calculation weights of the plurality of subject terms in the subject term set are updated, and the plurality of comments to be identified are updated to all Unrecognized comments that were not successfully identified.
可选的,根据识别出的正常评论,对所述主题词集进行主题词扩充,包括:获取所述识别出的正常评论中包括的高频词语,将所述高频词语作为新的主题词加入至所述主题词集中,并为新加入的主题词进行权重设置;统计新的主题词集中包括的每个主题词在目标文章中的出现频率,根据所述出现频率,在共现词词库匹配得到与所述新的主题词集中的至少一个主题词的出现频率关联的共现词作为新的主题词加入至所述主题词集中,并为新加入的主题词进行权重设置。Optionally, according to the identified normal comments, the subject term set is expanded, including: obtaining the high-frequency terms included in the identified normal comments, and using the high-frequency terms as new subject terms Add to the set of keywords, and carry out weight setting for newly added keywords; count the frequency of occurrence of each keyword included in the new keyword set in the target article, according to the frequency of occurrence, in the co-occurrence words The co-occurrence words associated with the occurrence frequency of at least one of the words in the new headword set obtained through library matching are added to the headword set as new words, and weight settings are performed for the newly added words.
其中,可根据公式
Figure PCTCN2022108563-appb-000006
对高频词语的权重进行调整。其中,Weight'(t r)是调整后的高频词语的权重;t r是正常评论中出现的词语;T(t r)是词语t r在正常评论中的权重,计算公式是1+log10(1+n k);T(k)是K轮相似度比较后,包括词语t r的正常评论的数量;N(k)是K轮相似度比较后,正常评论的总条数。
Among them, according to the formula
Figure PCTCN2022108563-appb-000006
Adjust the weight of high-frequency words. Among them, Weight'(t r ) is the adjusted weight of high-frequency words; t r is the word that appears in normal comments; T(t r ) is the weight of word t r in normal comments, and the calculation formula is 1+log10 (1+n k ); T(k) is the number of normal comments including word t r after K rounds of similarity comparisons; N(k) is the total number of normal comments after K rounds of similarity comparisons.
通过对正常评论中的词语的权重进行计算可直观地反映出正常评论中的词语在目标文章中所占权重,以将权重较高的词语加入至主题词集中。By calculating the weight of the words in the normal comments, it can intuitively reflect the weight of the words in the normal comments in the target article, so that the words with higher weights can be added to the subject word set.
可选的,根据识别出的正常评论,对主题词集中的多个主题词的计算权重进行更新,包括:Optionally, according to the identified normal comments, the calculation weights of multiple keywords in the keyword set are updated, including:
利用公式
Figure PCTCN2022108563-appb-000007
对主题词集中的每个主题词进行计算权重的更新。
use the formula
Figure PCTCN2022108563-appb-000007
The calculated weight is updated for each subject term in the subject term set.
其中,Weight'(i)为更新后的词语i的计算权重,词语i是目标文章的主题词集中的主题词,近义词i'是正常评论中出现的词语i的近义词;n p是词语i在目标文章中出现的次数;n k是K轮相似度比较后,词语i在正常评论中出现的次数;T(k)是K轮相似度比较后,包括词语i的正常评论的条数;N(k)是K轮相似度比较后,正常评论的总条数;N i'是词语i的近义词集合,Weight(i')是词语i的近义词i'在全部正常评论中的权重,Sim(i,i')是词语i和近义词i'之间的相似 度分值,μ是一个大于0的调节因子,调节词语i的近义词集合的中每个近义词的权重值和相似度对词语i的权重的影响程度。 Among them, Weight'(i) is the calculation weight of the updated word i, the word i is the subject word in the subject word set of the target article, and the synonym i' is the synonym of the word i appearing in the normal comment; n p is the word i in the The number of occurrences in the target article; n k is the number of times word i appears in normal comments after K rounds of similarity comparison; T(k) is the number of normal comments including word i after K rounds of similarity comparison; N (k) is the total number of normal comments after K rounds of similarity comparison; N i' is the set of synonyms of word i, Weight(i') is the weight of the synonyms i' of word i in all normal comments, Sim( i, i') is the similarity score between word i and synonym i', μ is an adjustment factor greater than 0, which adjusts the weight value and similarity of each synonym in the synonym set of word i to the effect of word i The degree of influence of the weight.
其中,1+log(1+n p+n k)表示的是词语的词频,对数中加1是为了避免对数计算出的值为零,因为n p的值很可能是0,此时n k的值如果为1,则log(n p+n k)计算出的值就是0,所以要在对数中加1,并且,由于对数计算出的值一般是小于1的,会使整个式子的值减小,可能会对后面的评论分类造成不利的影响,所以本实施例在对数前加1。式子中
Figure PCTCN2022108563-appb-000008
表示的是出现词语i的正常评论条数占总的正常评论条数的比例。由于并不是一个词语出现的频率越高越好,还要看该词语出现的文章数是不是均匀。在这里T(k)越大越好,说明该词语在该类中分布的越均匀,说明大家都在讨论此问题。因为
Figure PCTCN2022108563-appb-000009
是小于1的数,乘以它可以减小高频主题词的权重,从而减小对分类的负面影响,也可以在一定程度上减小高频主题词在虚假评论中对评论分类的影响,降低虚假评论的相似度分值。
Among them, 1+log(1+n p + nk ) represents the word frequency of the word, adding 1 to the logarithm is to avoid the value calculated by the logarithm from zero, because the value of n p is likely to be 0, at this time If the value of n k is 1, the value calculated by log(n p +n k ) is 0, so add 1 to the logarithm, and, since the value calculated by the logarithm is generally less than 1, it will make The decrease of the value of the whole formula may have an adverse effect on the subsequent comment classification, so in this embodiment, 1 is added before the logarithm. In the formula
Figure PCTCN2022108563-appb-000008
Indicates the ratio of the number of normal comments where word i appears to the total number of normal comments. Because it is not that the frequency of a word appears as high as possible, it also depends on whether the number of articles in which the word appears is even. Here, the larger T(k) is, the better, which means that the word is more evenly distributed in this class, and it means that everyone is discussing this issue. because
Figure PCTCN2022108563-appb-000009
is a number less than 1, multiplying it can reduce the weight of high-frequency keywords, thereby reducing the negative impact on classification, and can also reduce the impact of high-frequency keywords on review classification in false reviews to a certain extent, Reduce the similarity score of fake reviews.
如果一个主题词的近义词总是在其它评论中出现,说明大家讨论的内容与此主题词相关,也是与文章内容相关的,所以这样的主题词应加大计算权重,应该加上该主题词的近义词的那部分权重信息,即
Figure PCTCN2022108563-appb-000010
If a synonym of a topic word always appears in other comments, it means that what everyone is discussing is related to this topic word and also to the content of the article, so such a topic word should increase the calculation weight, and the keyword of the topic word should be added The part of the weight information of synonyms, that is,
Figure PCTCN2022108563-appb-000010
在一个可选的实施方式中,可以根据如下公式对词语i的近义词i'的计算权重进行调整,以实现对主题词i计算权重的调整。In an optional embodiment, the calculation weight of the synonym i' of the term i may be adjusted according to the following formula, so as to realize the adjustment of the calculation weight of the subject term i.
Figure PCTCN2022108563-appb-000011
Figure PCTCN2022108563-appb-000011
其中,i'是评论中出现的主题词集中词i的近义词;T(i')是词语i'在正常评论中的权重,计算公式是1+log10(1+n k);T(k)是K轮相似度比较后,包含词语i'的正常评论的条数;N(k)是K轮相似度比较后,词语i'出现的正常评论总条数;T(i p)是词语i'的近义词i p的权重,词语i p是计算权重调整前主题词集中的词语。N p是词语i'的近义词集合,从目标文章主题词集中得到。 Among them, i' is the synonym of the word i in the subject word set that appears in the comment; T(i') is the weight of the word i' in the normal comment, and the calculation formula is 1+log10(1+n k ); T(k) is the number of normal comments containing word i' after K rounds of similarity comparison; N(k) is the total number of normal comments that appear in word i' after K rounds of similarity comparison; T(i p ) is the number of normal comments for word i' The weight of the synonym i p of ', the word i p is the word in the subject word set before calculating the weight adjustment. N p is a set of synonyms of word i', which is obtained from the subject word set of the target article.
上述技术方案通过不断地调整主题词的计算权重可以使得待识别评论的识别结果更加准确可靠。The above technical solution can make the recognition result of the comment to be recognized more accurate and reliable by continuously adjusting the calculation weight of the subject words.
S390、返回执行根据主题词集中的每个主题词的计算权重,计算所述多个待识别评论中的每个待识别评论与目标文章之间的相似度,直至全部待识别评 论被成功识别。S390. Return to perform calculation of the similarity between each of the multiple unrecognized comments and the target article according to the calculation weight of each keyword in the keyword set, until all the unrecognized comments are successfully identified.
S3100、对识别出的所述多个备选垃圾评论进行过滤处理,并根据过滤结果,将识别出的每个备选垃圾评论识别为垃圾评论或者正常评论。S3100. Filter the multiple identified candidate spam comments, and identify each identified candidate spam comment as a spam comment or a normal comment according to the filtering result.
本申请实施例的技术方案,通过将与目标文章对应的全部评论与网络常用语词库进行匹配,得到备选垃圾评论、备选正常评论以及无法识别评论,并将备选正常评论和无法识别评论确定为待识别评论,在设定第一阈值及第二阈值的基础上利用文本相似度的计算方法对待识别评论进行识别,并在对全部待识别评论的成功识别之后,对备选垃圾评论进行二次过滤处理,使得无论是短小评论还是非短小评论都能识别出垃圾评论,实现对互联网评论信息中的垃圾评论进行自动识别,提高了垃圾评论的识别效果。In the technical solution of the embodiment of the present application, by matching all the comments corresponding to the target article with the dictionary of commonly used words in the network, the candidate spam comments, the normal comments and unrecognizable comments are obtained, and the normal comments and unrecognizable comments are combined The comment is determined to be a comment to be identified, and on the basis of setting the first threshold and the second threshold, the text similarity calculation method is used to identify the comment to be identified, and after all the comments to be identified are successfully identified, the alternative spam comment Secondary filtering is performed, so that no matter whether it is a short comment or a non-short comment, the spam comment can be identified, and the automatic identification of the spam comment in the Internet comment information is realized, and the identification effect of the spam comment is improved.
实施例四Embodiment Four
图4为本申请实施例四提供的一种垃圾评论的识别装置的结构示意图,该装置可以执行上述多个实施例中涉及到的一种垃圾评论的识别方法。该装置可采用软件和/或硬件的方式实现,如图4所示,所述垃圾评论的识别装置包括:相似度计算模块410、评论识别模块420、主题词集更新模块430、评论成功识别模块440。FIG. 4 is a schematic structural diagram of an apparatus for identifying spam comments provided in Embodiment 4 of the present application. The apparatus can implement a method for identifying spam comments involved in the above-mentioned embodiments. This device can be implemented in the form of software and/or hardware. As shown in Figure 4, the identification device of the spam comments includes: similarity calculation module 410, comment identification module 420, subject term set update module 430, comment success identification module 440.
相似度计算模块410,设置为获取与目标文章对应的多个待识别评论和主题词集,并根据主题词集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度;评论识别模块420,设置为根据相似度计算结果,在所述多个待识别评论中识别出备选垃圾评论和正常评论;主题词集更新模块430,设置为如果确定所述多个待识别评论中存在未成功识别的待识别评论,则根据识别出的正常评论,对所述主题词集进行以下操作中的至少之一,得到新的主题词集作为所述主题词集:对所述主题词集进行主题词扩充,和对所述主题词集中的所述多个主题词的计算权重进行更新,并将所述多个待识别评论更新为所述未成功识别的待识别评论;评论成功识别模块440,设置为返回执行根据主题词集中的每个主题词的计算权重,计算所述多个待识别评论中的每个待识别评论与目标文章之间的相似度,直至全部待识别评论被成功识别。The similarity calculation module 410 is configured to obtain a plurality of comments to be identified and keyword sets corresponding to the target article, and according to the calculation weight of each keyword in the plurality of keywords in the keyword set, calculate the relationship between each comment to be identified and The similarity between the target articles; the comment identification module 420 is set to identify alternative spam comments and normal comments in the plurality of comments to be identified according to the similarity calculation result; the subject term set update module 430 is set to if It is determined that there are unidentified comments to be identified among the plurality of comments to be identified, and then according to the identified normal comments, at least one of the following operations is performed on the set of subject terms to obtain a new set of subject terms as the Subject term set: expand the subject term set, and update the calculation weights of the plurality of subject terms in the subject term set, and update the plurality of comments to be identified as the unsuccessful Identify the comment to be identified; the comment success identification module 440 is configured to return to execute the calculation weight of each keyword in the subject word set, and calculate the weight between each of the plurality of comments to be identified and the target article Similarity until all the comments to be identified are successfully identified.
本申请实施例的技术方案通过利用主题词集中的每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度,并根据相似度计算结果,在多个待识别评论中识别出备选垃圾评论和正常评论,如果确定存在未成功识别的待识别评论,则在对主题词集更新后再次对未成功识别的待识别评论进行识别, 直至所有待识别评论均被识别出类别,能够实现对互联网评论信息中的垃圾评论进行多轮自动识别,提高了垃圾评论的识别效果。The technical solution of the embodiment of the present application calculates the similarity between each comment to be identified and the target article by using the calculation weight of each keyword in the keyword set, and according to the similarity calculation result, among multiple comments to be identified Identify alternative spam comments and normal comments. If it is determined that there are unidentified unrecognized comments, after updating the subject word set, re-identify unrecognized unrecognized comments until all unrecognized comments are identified categories, which can realize multiple rounds of automatic identification of spam comments in Internet comment information, and improve the identification effect of spam comments.
可选的,相似度计算模块410是设置为,获取与目标文章对应的全部评论,并将每个评论与网络常用语词库进行匹配;根据匹配结果得到备选垃圾评论、备选正常评论以及无法识别评论,并将所述备选正常评论和所述无法识别评论,确定为所述多个待识别评论。Optionally, the similarity calculation module 410 is configured to obtain all comments corresponding to the target article, and match each comment with the dictionary of commonly used words in the network; obtain alternative spam comments, alternative normal comments and The comment cannot be identified, and the candidate normal comment and the unidentified comment are determined as the plurality of comments to be identified.
可选的,所述识别出的备选垃圾评论的数量为多个;垃圾评论的识别装置还包括二次过滤模块,设置为在全部待识别评论被成功识别之后,对识别出的所述多个备选垃圾评论进行过滤处理,并根据过滤结果,将识别出的每个备选垃圾评论识别为垃圾评论或者正常评论。Optionally, the number of identified alternative spam comments is multiple; the identification device for spam comments also includes a secondary filtering module, which is configured to filter the identified multiple comments after all comments to be identified are successfully identified. Each candidate spam comment is filtered, and each identified candidate spam comment is identified as a spam comment or a normal comment according to the filtering results.
可选的,评论识别模块420是设置为,获取与每个待识别评论对应的相似度计算结果;如果确定所述每个待识别评论对应的相似度计算结果小于或者等于第一阈值,则确定所述每个待识别评论为备选垃圾评论;如果确定所述每个待识别评论对应的相似度计算结果大于或者等于第二阈值,则确定所述每个待识别评论为正常评论;如果确定所述每个待识别评论的相似度计算结果大于所述第一阈值且小于所述第二阈值,则确定未能对所述每个待识别评论进行成功识别。Optionally, the comment identification module 420 is configured to obtain a similarity calculation result corresponding to each comment to be identified; if it is determined that the similarity calculation result corresponding to each comment to be identified is less than or equal to the first threshold, then determine Each comment to be identified is an alternative spam comment; if it is determined that the similarity calculation result corresponding to each comment to be identified is greater than or equal to a second threshold, then it is determined that each comment to be identified is a normal comment; if it is determined If the similarity calculation result of each comment to be identified is greater than the first threshold and less than the second threshold, then it is determined that each comment to be identified has not been successfully identified.
可选的,所述每个主题词的计算权重包括:每个主题词在目标文章中的权重;相似度计算模块410是设置为,利用公式
Figure PCTCN2022108563-appb-000012
计算每个待识别评论与目标文章之间的相似度;其中,C k表示第k条待识别评论的向量,P表示目标文章的向量,n为向量的维数,w i表示主题词i在目标文章中的权重,w ik表示主题词i在第k条待识别评论中的权重,S i表示词语间的语义信息,在进行首轮待识别评论与目标文章之间的相似度计算时,S i为1,其余轮次中
Figure PCTCN2022108563-appb-000013
Sim(P i,C i'k)表示第k条待识别评论中的主题词i与目标文章中的主题词i的近义词i'之间的相似度分值,
Figure PCTCN2022108563-appb-000014
表示词形相似度,LenP是目标文章中主题词的个数,Same(P,C k)是第k条待识别评论中出现的目标文章中主题词或者主题词的近义词的个数。
Optionally, the calculation weight of each subject term includes: the weight of each subject term in the target article; the similarity calculation module 410 is set to use the formula
Figure PCTCN2022108563-appb-000012
Calculate the similarity between each comment to be identified and the target article; among them, C k represents the vector of the kth comment to be recognized, P represents the vector of the target article, n is the dimension of the vector, and w i represents the subject word i in The weight in the target article, w ik represents the weight of the topic word i in the k comment to be recognized, and S i represents the semantic information between words. When calculating the similarity between the first round of the comment to be recognized and the target article, S i is 1, and in the remaining rounds
Figure PCTCN2022108563-appb-000013
Sim(P i ,C i'k ) represents the similarity score between the subject word i in the kth comment to be identified and the synonym i' of the subject term i in the target article,
Figure PCTCN2022108563-appb-000014
Indicates word form similarity, LenP is the number of subject words in the target article, Same(P,C k ) is the number of subject words or synonyms of the subject words in the target article appearing in the kth comment to be recognized.
可选的,主题词集更新模块430是设置为,获取所述识别出的正常评论中包括的高频词语,将所述高频词语作为新的主题词加入至所述主题词集中,并 为新加入的主题词进行权重设置;统计新的主题词集中包括的每个主题词在目标文章中的出现频率,根据所述出现频率,在共现词词库匹配得到与所述新的主题词集中的至少一个主题词的出现频率关联的共现词作为新的主题词加入至所述主题词集中,并为新加入的主题词进行权重设置。Optionally, the subject term set update module 430 is configured to obtain the high-frequency terms included in the identified normal comments, add the high-frequency terms as new subject terms to the subject term set, and The newly added subject words carry out weight setting; Count the frequency of occurrence of each subject word included in the target article in the new subject words set, according to the frequency of occurrence, match the co-occurrence word thesaurus to obtain the new subject words The co-occurrence words associated with the frequency of occurrence of at least one subject term in the set are added to the subject term set as new subject terms, and weight settings are performed for the newly added subject terms.
可选的,主题词集更新模块430是设置为,利用公式
Figure PCTCN2022108563-appb-000015
对主题词集中的每个主题词进行计算权重的更新;其中,Weight'(i)为更新后的词语i的计算权重,词语i是目标文章的主题词集中的主题词,近义词i'是正常评论中出现的词语i的近义词;n p是词语i在目标文章中出现的次数;n k是K轮相似度比较后,词语i在正常评论中出现的次数;T(k)是K轮相似度比较后,包括词语i的正常评论的条数;N(k)是K轮相似度比较后,正常评论的总条数;N i'是词语i的近义词集合,Weight(i')是词语i的近义词i'在全部正常评论中的权重,Sim(i,i')是词语i和近义词i'之间的相似度分值,μ是一个大于0的调节因子。
Optionally, the subject term set update module 430 is set to use the formula
Figure PCTCN2022108563-appb-000015
Update the calculated weight of each subject term in the subject term set; among them, Weight'(i) is the calculated weight of the updated term i, term i is the subject term in the subject term set of the target article, and the synonym i' is the normal Synonyms of word i appearing in comments; n p is the number of times word i appears in the target article; n k is the number of times word i appears in normal comments after K rounds of similarity comparison; T(k) is K rounds of similarity After degree comparison, the number of normal comments including word i; N(k) is the total number of normal comments after K rounds of similarity comparison; N i' is the set of synonyms for word i, and Weight(i') is the The weight of i's synonym i' in all normal comments, Sim(i,i') is the similarity score between word i and synonym i', μ is an adjustment factor greater than 0.
本申请实施例所提供的垃圾评论的识别装置可执行本申请任意实施例所提供的垃圾评论的识别方法,具备执行方法相应的功能模块。The spam comment identification device provided in the embodiment of the present application can execute the spam comment identification method provided in any embodiment of the present application, and has corresponding functional modules for executing the method.
实施例五Embodiment five
图5为本申请实施例五提供的一种计算机设备的结构示意图,如图5所示,该计算机设备包括处理器510、存储器520、输入装置530和输出装置540;计算机设备中处理器510的数量可以是一个或多个,图5中以一个处理器510为例;计算机设备中的处理器510、存储器520、输入装置530和输出装置540可以通过总线或其他方式连接,图5中以通过总线连接为例。FIG. 5 is a schematic structural diagram of a computer device provided in Embodiment 5 of the present application. As shown in FIG. 5, the computer device includes a processor 510, a memory 520, an input device 530, and an output device 540; The quantity can be one or more, and a processor 510 is taken as an example in FIG. Take the bus connection as an example.
存储器520作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本申请实施例中的垃圾评论的识别方法对应的程序指令/模块(例如,垃圾评论的识别装置中的相似度计算模块410、评论识别模块420、主题词集更新模块430和评论成功识别模块440)。处理器510通过运行存储在存储器520中的软件程序、指令以及模块,从而执行计算机设备的多种功能应用以及数据处理,即实现上述的垃圾评论的识别方法。The memory 520, as a computer-readable storage medium, can be used to store software programs, computer-executable programs and modules, such as program instructions/modules corresponding to the method for identifying spam comments in the embodiment of the present application (for example, the device for identifying spam comments The similarity calculation module 410, the comment identification module 420, the subject term set update module 430 and the comment success identification module 440). The processor 510 executes various functional applications and data processing of the computer device by running the software programs, instructions and modules stored in the memory 520 , that is, realizes the above-mentioned method for identifying spam comments.
该方法包括:获取与目标文章对应的多个待识别评论和主题词集,并根据主题词集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度;根据相似度计算结果,在所述多个待识别评论中识别出备选垃圾评论和正常评论;如果确定所述多个待识别评论中存在未成功识别的待识别评论,则根据识别出的正常评论,对所述主题词集进行以下操作中的至 少之一,得到新的主题词集作为所述主题词集:对所述主题词集进行主题词扩充,和对所述主题词集中的所述多个主题词的计算权重进行更新,并将所述多个待识别评论更新为所述未成功识别的待识别评论;返回执行根据主题词集中的每个主题词的计算权重,计算所述多个待识别评论中的每个待识别评论与目标文章之间的相似度,直至全部待识别评论被成功识别。The method includes: obtaining a plurality of to-be-recognized comments and subject word sets corresponding to the target article, and calculating the relationship between each to-be-identified comment and the target article according to the calculation weight of each subject word in the plurality of subject words in the subject word set the similarity; according to the similarity calculation result, identify alternative spam comments and normal comments in the plurality of comments to be identified; if it is determined that there are unidentified comments to be identified in the plurality of comments to be identified, then according to For the identified normal comments, at least one of the following operations is performed on the set of subject terms to obtain a new set of subject terms as the set of subject terms: subject term expansion is performed on the subject term set, and the subject The calculation weights of the plurality of subject terms in the term set are updated, and the plurality of comments to be identified are updated as the unsuccessfully identified comments to be identified; return to execute according to the calculation weight of each subject term in the subject term set , calculating the similarity between each of the plurality of unidentified comments and the target article, until all unidentified comments are successfully identified.
存储器520可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据终端的使用所创建的数据等。此外,存储器520可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器520可进一步包括相对于处理器510远程设置的存储器,这些远程存储器可以通过网络连接至计算机设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the terminal, and the like. In addition, the memory 520 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices. In some examples, the memory 520 may further include memory located remotely from the processor 510, and these remote memories may be connected to the computer device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
输入装置530可用于接收输入的数字或字符信息,以及产生与计算机设备的用户设置以及功能控制有关的键信号输入。输出装置540可包括显示屏等显示设备。The input device 530 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the computer equipment. The output device 540 may include a display device such as a display screen.
实施例六Embodiment six
本申请实施例六还提供一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行一种垃圾评论的识别方法,该方法包括:获取与目标文章对应的多个待识别评论和主题词集,并根据主题词集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度;根据相似度计算结果,在所述多个待识别评论中识别出备选垃圾评论和正常评论;如果确定所述多个待识别评论中存在未成功识别的待识别评论,则根据当识别出的正常评论,对所述主题词集进行以下操作中的至少之一,得到新的主题词集作为所述主题词集:对所述主题词集进行主题词扩充,和对所述主题词集中的所述多个主题词的计算权重进行更新,并将所述多个待识别评论更新为所述未成功识别的待识别评论;返回执行根据主题词集中的每个主题词的计算权重,计算所述多个待识别评论中的每个待识别评论与目标文章之间的相似度,直至全部待识别评论被成功识别。Embodiment 6 of the present application also provides a storage medium containing computer-executable instructions, the computer-executable instructions are used to implement a method for identifying spam comments when executed by a computer processor, the method includes: obtaining and target articles Corresponding multiple to-be-recognized comments and keyword sets, and according to the calculation weight of each keyword in the multiple keyword sets in the keyword set, calculate the similarity between each to-be-identified comment and the target article; calculate according to the similarity As a result, alternative spam comments and normal comments are identified in the plurality of comments to be identified; Performing at least one of the following operations on the subject term set to obtain a new subject term set as the subject term set: performing subject term expansion on the subject term set, and performing the subject term expansion on the plurality of subject term sets in the subject term set The calculation weight of the subject words is updated, and the plurality of comments to be identified are updated to the comments to be identified that have not been successfully identified; return to execute according to the calculation weight of each subject term in the subject term set, calculate the plurality of comments to be identified Identify the similarity between each unidentified comment in the comments and the target article until all unidentified comments are successfully identified.
当然,本申请实施例所提供的一种包含计算机可执行指令的存储介质,其计算机可执行指令不限于如上所述的方法操作,还可以执行本申请任意实施例所提供的垃圾评论的识别方法中的相关操作。Of course, a storage medium containing computer-executable instructions provided in the embodiments of the present application, the computer-executable instructions are not limited to the method operations described above, and can also implement the spam comment identification method provided in any embodiment of the present application Related operations in .
通过以上关于实施方式的描述,所属领域的技术人员可以清楚地了解到, 本申请可借助软件及必需的通用硬件来实现,当然也可以通过硬件实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请的多个实施例所述的方法。Through the above description about the implementation, those skilled in the art can clearly understand that the present application can be implemented by means of software and necessary general-purpose hardware, and of course it can also be implemented by hardware, but in many cases the former is a better implementation mode . Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product can be stored in a computer-readable storage medium, such as a floppy disk of a computer , read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or optical disc, etc., including several instructions to make a computer device (which can be a personal computer) , server, or network device, etc.) execute the methods described in multiple embodiments of the present application.
值得注意的是,上述垃圾评论的识别装置的实施例中,所包括的多个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,每个功能单元的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。It should be noted that in the embodiment of the device for identifying spam comments above, the multiple units and modules included are only divided according to functional logic, but are not limited to the above-mentioned divisions, as long as the corresponding functions can be realized; In addition, the specific name of each functional unit is only for the convenience of distinguishing each other, and is not used to limit the protection scope of the present application.

Claims (10)

  1. 一种垃圾评论的识别方法,包括:A method for identifying spam comments, comprising:
    获取与目标文章对应的多个待识别评论和主题词集,并根据主题词集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度;Obtain multiple to-be-recognized comments and subject word sets corresponding to the target article, and calculate the similarity between each to-be-identified comment and the target article according to the calculation weight of each subject word in the plurality of subject words in the subject term set;
    根据相似度计算结果,在所述多个待识别评论中识别出备选垃圾评论和正常评论;Identifying alternative spam comments and normal comments among the plurality of comments to be identified according to the similarity calculation result;
    在确定所述多个待识别评论中存在未成功识别的待识别评论的情况下,根据识别出的正常评论,对所述主题词集进行以下操作中的至少之一,得到新的主题词集作为所述主题词集:对所述主题词集进行主题词扩充,和对所述主题词集中的所述多个主题词的计算权重进行更新;并将所述多个待识别评论更新为所述未成功识别的待识别评论;When it is determined that there are unrecognized comments among the plurality of unrecognized comments, according to the identified normal comments, at least one of the following operations is performed on the subject term set to obtain a new subject term set As the subject term set: subject term expansion is performed on the subject term set, and the calculation weights of the plurality of subject terms in the subject term set are updated; and the plurality of comments to be identified are updated as the Comments to be identified that were not successfully identified;
    返回执行根据主题词集中的每个主题词的计算权重,计算所述多个待识别评论中的每个待识别评论与目标文章之间的相似度,直至全部待识别评论被成功识别。Returning to the execution of calculating the similarity between each of the plurality of unrecognized comments and the target article according to the calculation weight of each keyword in the keyword set, until all the unrecognized comments are successfully identified.
  2. 根据权利要求1所述的方法,其中,获取与目标文章对应的多个待识别评论,包括:The method according to claim 1, wherein obtaining a plurality of comments to be identified corresponding to the target article comprises:
    获取与目标文章对应的全部评论,并将每个评论与网络常用语词库进行匹配;Obtain all the comments corresponding to the target article, and match each comment with the dictionary of commonly used words on the Internet;
    根据匹配结果得到备选垃圾评论、备选正常评论以及无法识别评论,并将所述备选正常评论和所述无法识别评论,确定为所述多个待识别评论。According to the matching result, candidate spam comments, candidate normal comments, and unrecognizable comments are obtained, and the candidate normal comments and the unrecognizable comments are determined as the plurality of comments to be identified.
  3. 根据权利要求2所述的方法,其中,所述识别出的备选垃圾评论的数量为多个;The method according to claim 2, wherein the number of the identified alternative spam comments is multiple;
    在全部待识别评论被成功识别之后,还包括:After all the comments to be identified are successfully identified, it also includes:
    对识别出的所述多个备选垃圾评论进行过滤处理,并根据过滤结果,将识别出的每个备选垃圾评论识别为垃圾评论或者正常评论。The plurality of identified candidate spam comments are filtered, and each identified candidate spam comment is identified as a spam comment or a normal comment according to the filtering result.
  4. 根据权利要求1-3任一项所述的方法,其中,根据相似度计算结果,在所述多个待识别评论中识别出备选垃圾评论和正常评论,包括:The method according to any one of claims 1-3, wherein, according to the similarity calculation result, identifying candidate junk comments and normal comments among the plurality of comments to be identified includes:
    获取与每个待识别评论对应的相似度计算结果;Obtain the similarity calculation result corresponding to each comment to be identified;
    在所述每个待识别评论对应的相似度计算结果小于或者等于第一阈值的情况下,确定所述每个待识别评论为备选垃圾评论;In the case where the similarity calculation result corresponding to each comment to be identified is less than or equal to a first threshold, determine that each comment to be identified is a candidate spam comment;
    在所述每个待识别评论对应的相似度计算结果大于或者等于第二阈值的情况下,则确定所述每个待识别评论为正常评论;In the case where the similarity calculation result corresponding to each comment to be identified is greater than or equal to a second threshold, it is determined that each comment to be identified is a normal comment;
    在所述每个待识别评论对应的相似度计算结果大于所述第一阈值且小于所述第二阈值的情况下,确定未能对所述每个待识别评论进行成功识别。In a case where the similarity calculation result corresponding to each comment to be identified is greater than the first threshold and less than the second threshold, it is determined that each comment to be identified has not been successfully identified.
  5. 根据权利要求1-3任一项所述的方法,其中,所述每个主题词的计算权重包括:每个主题词在目标文章中的权重;The method according to any one of claims 1-3, wherein the calculation weight of each subject term comprises: the weight of each subject term in the target article;
    所述根据主题词集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度,包括:The calculation of the similarity between each comment to be identified and the target article is calculated according to the calculation weight of each of the multiple keywords in the keyword set, including:
    利用公式
    Figure PCTCN2022108563-appb-100001
    计算每个待识别评论与目标文章之间的相似度;
    use the formula
    Figure PCTCN2022108563-appb-100001
    Calculate the similarity between each comment to be identified and the target article;
    其中,C k表示第k条待识别评论的向量,P表示目标文章的向量,n为向量的维数,w i表示主题词i在目标文章中的权重,w ik表示主题词i在第k条待识别评论中的权重,S i表示词语间的语义信息,在进行首轮待识别评论与目标文章之间的相似度计算时,S i为1,其余轮次中
    Figure PCTCN2022108563-appb-100002
    Sim(P i,C i'k)表示第k条待识别评论中的主题词i与目标文章中的主题词i的近义词i'之间的相似度分值,
    Figure PCTCN2022108563-appb-100003
    表示词形相似度,LenP是目标文章中主题词的个数,Same(P,C k)是第k条待识别评论中出现的目标文章中主题词或者主题词的近义词的个数。
    Among them, C k represents the vector of the kth comment to be identified, P represents the vector of the target article, n is the dimension of the vector, w i represents the weight of the topic word i in the target article, and w ik represents the kth of the topic word i The weights in the comments to be identified, S i represents the semantic information between words, in the first round of similarity calculation between the comments to be identified and the target article, S i is 1, and in the remaining rounds
    Figure PCTCN2022108563-appb-100002
    Sim(P i ,C i'k ) represents the similarity score between the subject word i in the kth comment to be identified and the synonym i' of the subject term i in the target article,
    Figure PCTCN2022108563-appb-100003
    Indicates word form similarity, LenP is the number of subject words in the target article, Same(P,C k ) is the number of subject words or synonyms of the subject words in the target article appearing in the kth comment to be recognized.
  6. 根据权利要求1-3任一项所述的方法,其中,根据识别出的正常评论,对所述主题词集进行主题词扩充,包括:The method according to any one of claims 1-3, wherein, according to the identified normal comments, the subject term set is expanded, comprising:
    获取所述识别出的正常评论中包括的高频词语,将所述高频词语作为新的主题词加入至所述主题词集中,并为新加入的主题词进行权重设置;Obtaining the high-frequency words included in the identified normal comments, adding the high-frequency words as new keywords to the keyword set, and setting weights for the newly added keywords;
    统计新的主题词集中包括的每个主题词在目标文章中的出现频率,根据所述出现频率,在共现词词库匹配得到与所述新的主题词集中的至少一个主题词的出现频率关联的共现词作为新的主题词加入至所述主题词集中,并为新加入的主题词进行权重设置。Count the frequency of occurrence of each subject term included in the new subject term set in the target article, and according to the frequency of occurrence, obtain the frequency of occurrence of at least one subject term in the new subject term set by matching in the co-occurrence term thesaurus Associated co-occurrence words are added to the set of themes as new keywords, and weight settings are performed for the newly added keywords.
  7. 根据权利要求1-3任一项所述的方法,其中,根据识别出的正常评论,对所述主题词集中的所述多个主题词的计算权重进行更新,包括:The method according to any one of claims 1-3, wherein, according to the identified normal comments, updating the calculation weights of the plurality of subject terms in the subject term set includes:
    利用公式
    Figure PCTCN2022108563-appb-100004
    对主题词集中的每个主题词进行计算权重的更新;
    use the formula
    Figure PCTCN2022108563-appb-100004
    Update the calculation weight of each subject term in the subject term set;
    其中,Weight'(i)为更新后的词语i的计算权重,词语i是目标文章的主题词集中的主题词,近义词i'是正常评论中出现的词语i的近义词;n p是词语i在目标文章中出现的次数;n k是K轮相似度比较后,词语i在正常评论中出现的次数;T(k)是K轮相似度比较后,包括词语i的正常评论的条数;N(k)是K轮相似度比较后,正常评论的总条数;N i'是词语i的近义词集合,Weight(i')是词语i 的近义词i'在全部正常评论中的权重,Sim(i,i')是词语i和近义词i'之间的相似度分值,μ是一个大于0的调节因子。 Among them, Weight'(i) is the calculation weight of the updated word i, the word i is the subject word in the subject word set of the target article, and the synonym i' is the synonym of the word i appearing in the normal comment; n p is the word i in the The number of occurrences in the target article; n k is the number of times word i appears in normal comments after K rounds of similarity comparison; T(k) is the number of normal comments including word i after K rounds of similarity comparison; N (k) is the total number of normal comments after K rounds of similarity comparison; N i' is the set of synonyms of word i, Weight(i') is the weight of the synonyms i' of word i in all normal comments, Sim( i, i') is the similarity score between word i and synonym i', and μ is an adjustment factor greater than 0.
  8. 一种垃圾评论的识别装置,包括:A device for identifying spam comments, comprising:
    相似度计算模块,设置为获取与目标文章对应的多个待识别评论和主题词集,并根据主题词集中的多个主题词中每个主题词的计算权重,计算每个待识别评论与目标文章之间的相似度;The similarity calculation module is configured to obtain multiple to-be-recognized comments and keyword sets corresponding to the target article, and calculate the relationship between each to-be-recognized comment and the target word according to the calculation weight of each theme word in the multiple theme words in the theme word set. Similarity between articles;
    评论识别模块,设置为根据相似度计算结果,在所述多个待识别评论中识别出备选垃圾评论和正常评论;The comment identification module is configured to identify alternative spam comments and normal comments in the plurality of comments to be identified according to the similarity calculation result;
    主题词集更新模块,设置为在确定所述多个待识别评论中存在未成功识别的待识别评论的情况下,根据识别出的正常评论,对所述主题词集进行以下操作中的至少之一,得到新的主题词集作为所述主题词集:对所述主题词集进行主题词扩充,和对所述主题词中的所述多个主题词的计算权重进行更新;并将所述多个待识别评论更新为所述未成功识别的待识别评论;The subject term set update module is configured to perform at least one of the following operations on the subject term set according to the identified normal comments when it is determined that there are unrecognized comments among the plurality of subject term sets that are not successfully identified One, obtain a new subject term set as the subject term set: carry out subject term expansion to the subject term set, and update the calculation weights of the plurality of subject terms in the subject terms; The multiple unidentified comments are updated to the unidentified unidentified comments;
    评论成功识别模块,设置为返回执行根据主题词集中的每个主题词的计算权重,计算所述多个待识别评论中的每个待识别评论与目标文章之间的相似度,直至全部待识别评论被成功识别。The comment success identification module is set to return and execute the calculation weight according to each subject term in the subject term set, and calculate the similarity between each comment to be identified in the plurality of comments to be identified and the target article, until all the comments to be identified Comments were successfully identified.
  9. 一种计算机设备,包括:A computer device comprising:
    至少一个处理器;at least one processor;
    存储装置,设置为存储至少一个程序,当所述至少一个程序被所述至少一个处理器执行时,使得所述至少一个处理器实现如权利要求1-7中任一所述的垃圾评论的识别方法。A storage device configured to store at least one program, and when the at least one program is executed by the at least one processor, the at least one processor can realize the identification of spam comments according to any one of claims 1-7 method.
  10. 一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如权利要求1-7中任一所述的垃圾评论的识别方法。A computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, the method for identifying spam comments according to any one of claims 1-7 is implemented.
PCT/CN2022/108563 2021-08-12 2022-07-28 Spam comment identification method and apparatus, and device and medium WO2023016267A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110925078.4 2021-08-12
CN202110925078.4A CN113656580A (en) 2021-08-12 2021-08-12 Method, device, equipment and medium for identifying spam comments

Publications (1)

Publication Number Publication Date
WO2023016267A1 true WO2023016267A1 (en) 2023-02-16

Family

ID=78491540

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/108563 WO2023016267A1 (en) 2021-08-12 2022-07-28 Spam comment identification method and apparatus, and device and medium

Country Status (2)

Country Link
CN (1) CN113656580A (en)
WO (1) WO2023016267A1 (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113656580A (en) * 2021-08-12 2021-11-16 北京锐安科技有限公司 Method, device, equipment and medium for identifying spam comments

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8832116B1 (en) * 2012-01-11 2014-09-09 Google Inc. Using mobile application logs to measure and maintain accuracy of business information
CN109783616A (en) * 2018-12-03 2019-05-21 广东蔚海数问大数据科技有限公司 A kind of text subject extracting method, system and storage medium
CN110209795A (en) * 2018-06-11 2019-09-06 腾讯科技(深圳)有限公司 Comment on recognition methods, device, computer readable storage medium and computer equipment
CN112559685A (en) * 2020-12-11 2021-03-26 芜湖汽车前瞻技术研究院有限公司 Automobile forum spam comment identification method
CN113656580A (en) * 2021-08-12 2021-11-16 北京锐安科技有限公司 Method, device, equipment and medium for identifying spam comments

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102254038B (en) * 2011-08-11 2013-01-23 武汉安问科技发展有限责任公司 System and method for analyzing network comment relevance
CN103226576A (en) * 2013-04-01 2013-07-31 杭州电子科技大学 Comment spam filtering method based on semantic similarity
CN109902179A (en) * 2019-03-04 2019-06-18 上海宝尊电子商务有限公司 The method of screening electric business comment spam based on natural language processing
CN111125305A (en) * 2019-12-05 2020-05-08 东软集团股份有限公司 Hot topic determination method and device, storage medium and electronic equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8832116B1 (en) * 2012-01-11 2014-09-09 Google Inc. Using mobile application logs to measure and maintain accuracy of business information
CN110209795A (en) * 2018-06-11 2019-09-06 腾讯科技(深圳)有限公司 Comment on recognition methods, device, computer readable storage medium and computer equipment
CN109783616A (en) * 2018-12-03 2019-05-21 广东蔚海数问大数据科技有限公司 A kind of text subject extracting method, system and storage medium
CN112559685A (en) * 2020-12-11 2021-03-26 芜湖汽车前瞻技术研究院有限公司 Automobile forum spam comment identification method
CN113656580A (en) * 2021-08-12 2021-11-16 北京锐安科技有限公司 Method, device, equipment and medium for identifying spam comments

Also Published As

Publication number Publication date
CN113656580A (en) 2021-11-16

Similar Documents

Publication Publication Date Title
CN108509425B (en) Chinese new word discovery method based on novelty
CN109299480B (en) Context-based term translation method and device
CN106156204B (en) Text label extraction method and device
WO2020140373A1 (en) Intention recognition method, recognition device and computer-readable storage medium
US11416708B2 (en) Search item generation method and related device
WO2019218527A1 (en) Multi-system combined natural language processing method and apparatus
CN108009135B (en) Method and device for generating document abstract
CN112069298A (en) Human-computer interaction method, device and medium based on semantic web and intention recognition
CN111241267A (en) Abstract extraction and abstract extraction model training method, related device and storage medium
CN111191442B (en) Similar problem generation method, device, equipment and medium
CN108920599B (en) Question-answering system answer accurate positioning and extraction method based on knowledge ontology base
CN111241813B (en) Corpus expansion method, apparatus, device and medium
WO2017020454A1 (en) Search method and apparatus
CN111460170B (en) Word recognition method, device, terminal equipment and storage medium
CN110134777B (en) Question duplication eliminating method and device, electronic equipment and computer readable storage medium
WO2021082780A1 (en) Log classification method and device
WO2023016267A1 (en) Spam comment identification method and apparatus, and device and medium
CN106815190B (en) Word recognition method and device and server
CN109189907A (en) A kind of search method and device based on semantic matches
CN110287493B (en) Risk phrase identification method and device, electronic equipment and storage medium
CN113934848B (en) Data classification method and device and electronic equipment
CN113191145B (en) Keyword processing method and device, electronic equipment and medium
CN107239455A (en) Core word recognition method and device
CN113128205A (en) Script information processing method and device, electronic equipment and storage medium
CN112182332A (en) Emotion classification method and system based on crawler collection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22855250

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE