WO2023016267A1 - Procédé et appareil d'identification de commentaire de pourriel, et dispositif et support - Google Patents
Procédé et appareil d'identification de commentaire de pourriel, et dispositif et support Download PDFInfo
- Publication number
- WO2023016267A1 WO2023016267A1 PCT/CN2022/108563 CN2022108563W WO2023016267A1 WO 2023016267 A1 WO2023016267 A1 WO 2023016267A1 CN 2022108563 W CN2022108563 W CN 2022108563W WO 2023016267 A1 WO2023016267 A1 WO 2023016267A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- comments
- identified
- comment
- subject
- similarity
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 55
- 238000004364 calculation method Methods 0.000 claims abstract description 101
- 238000001914 filtration Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 2
- 230000003416 augmentation Effects 0.000 abstract 1
- 230000015654 memory Effects 0.000 description 12
- 230000000694 effects Effects 0.000 description 7
- 238000010586 diagram Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 238000012552 review Methods 0.000 description 4
- 238000012549 training Methods 0.000 description 4
- 101100129590 Schizosaccharomyces pombe (strain 972 / ATCC 24843) mcp5 gene Proteins 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 206010048909 Boredom Diseases 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000009499 grossing Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000000877 morphologic effect Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000004321 preservation Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 235000013555 soy sauce Nutrition 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/3332—Query translation
- G06F16/3334—Selection or weighting of terms from queries, including natural language queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3347—Query execution using vector based model
Definitions
- the embodiments of the present application relate to the technical field of big data mining, for example, to a method, device, equipment and medium for identifying spam comments.
- methods for preventing and identifying spam comments are mainly divided into two categories: manual identification methods and automatic identification methods.
- the automatic identification method can be further divided into classification identification method based on training set and identification method based on similarity.
- the method of artificial recognition can only identify newly published comments, filter out the spam comments in the newly published comments, but can do nothing for the published spam comments; the method of artificial recognition requires continuous manual maintenance, which is not very convenient ; Moreover, spammers can use a variety of proxy methods to cheat filtering mechanisms.
- the classification method based on the training set due to the convenience of the network, the comment update speed is relatively fast, and the feature words change a lot, so in order to make the classifier more accurately identify spam comments, the training samples must be changed with this change If the training sample changes, the feature item must be reselected, and the weight of the feature item must be recalculated and extracted, which seriously affects the efficiency of the system operation and brings inconvenience.
- the embodiments of the present application provide a method, device, device and medium for identifying spam comments, so as to realize automatic identification of spam comments in Internet comment information.
- the embodiment of the present application provides a method for identifying spam comments, including:
- At least one of the following operations is performed on the set of subject terms to obtain a new set of subject terms as the set of subject terms
- the subject term set subject term expansion is performed on the subject term set, and the calculation weights of the plurality of subject terms in the subject term set are updated; and the plurality of comments to be identified are updated to the unrecognized Successfully identified to-be-recognized comments;
- the embodiment of the present application also provides a spam comment identification device, the device includes:
- the similarity calculation module is configured to obtain multiple to-be-recognized comments and keyword sets corresponding to the target article, and calculate the relationship between each to-be-recognized comment and the target word according to the calculation weight of each theme word in the multiple theme words in the theme word set. Similarity between articles;
- the comment identification module is configured to identify alternative spam comments and normal comments in the plurality of comments to be identified according to the similarity calculation result;
- the subject term set updating module is configured to perform at least one of the following operations on the subject term set according to the identified normal comments if it is determined that there are unrecognized comments among the plurality of subject term sets, Obtaining a new subject term set as the subject term set: performing subject term expansion on the subject term set, and updating the calculation weights of the plurality of subject terms in the subject term set;
- the comment to be identified is updated to the comment to be identified that has not been successfully identified;
- the comment success identification module is set to return and execute the calculation weight according to each subject term in the subject term set, and calculate the similarity between each comment to be identified in the plurality of comments to be identified and the target article, until all the comments to be identified Comments were successfully identified.
- the embodiment of the present application also provides a computer device, and the computer device includes:
- processors one or more processors
- a storage device for storing one or more programs, and when the one or more programs are executed by the one or more processors, the one or more processors implement the How to identify spam comments.
- the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, wherein when the program is executed by a processor, the method for identifying spam comments as described in any embodiment of the present application is implemented.
- Fig. 1 is a flowchart of a method for identifying spam comments in Embodiment 1 of the present application
- Fig. 2 is a flowchart of a method for identifying spam comments in Embodiment 2 of the present application
- Fig. 3a is a flow chart of a method for identifying spam comments in Embodiment 3 of the present application.
- Fig. 3b is an overall block diagram of a spam comment identification method in Embodiment 3 of the present application.
- Fig. 4 is a schematic structural diagram of an identification device for spam comments in Embodiment 4 of the present application.
- FIG. 5 is a schematic structural diagram of a computer device in Embodiment 5 of the present application.
- Fig. 1 is the flowchart of the method for identifying spam comments provided by Embodiment 1 of the present application.
- This embodiment is applicable to the situation of identifying spam comments in Internet comment information, and the method can be executed by an identification device for spam comments.
- the device can be implemented in the form of hardware and/or software, and can generally be integrated into a computer device with a function of identifying spam comments, for example, a terminal device or a server, etc.
- the method specifically includes the following steps:
- the comments to be identified refer to Internet comment information that corresponds to the target article and needs to be identified.
- the subject term set refers to a term set composed of a plurality of subject terms corresponding to the target article.
- the calculation weight of each subject term can be calculated according to the formula 1+log10(1+n), wherein, n represents the number of times the subject term appears in the target article.
- Spam comments refer to comments with low similarity with the target article, that is, comments that are not strongly related to the target article; alternative spam comments refer to comments with a low initial judgment of similarity with the target article , the category needs to be confirmed in the next step to finalize its category; normal comments refer to comments with a high degree of similarity with the target article, that is, comments with a strong correlation with the target article.
- multiple to-be-recognized comments can be classified to distinguish candidate spam comments from normal comments.
- comments to be identified whose similarity calculation results are greater than or equal to a preset threshold can be directly determined as normal comments, and the similarity calculation results are less than or equal to Comments to be identified equal to a preset threshold (for example, 5%) are directly determined as candidate spam comments, and comments to be identified whose similarity calculation results are within a preset threshold (for example: 5%-90%) , determined as unrecognized comments to be recognized.
- a preset threshold for example, 90%
- the candidate spam comments may be directly determined as spam comments, or a secondary screening may be performed on the candidate spam comments, which is not limited in this embodiment.
- Unrecognized unrecognized comments refer to the third category of comments that are neither spam candidates nor normal comments.
- To expand the subject term set refers to adding newly selected subject terms to the subject term set.
- the selection rules for new subject terms can be set according to actual needs.
- the synonyms of the subject terms in the target article can be used as new Subject headings, which are not limited in this embodiment.
- the unrecognized comments in this step refer to unrecognized unrecognized comments.
- the weights of all newly expanded keywords in the expanded keyword set we can calculate the weight of each unsuccessfully identified one.
- the similarity between the to-be-recognized comments and the target article is used to complete the identification of all unsuccessfully identified to-be-recognized comments. If all the unrecognized comments to be identified cannot be identified after the subject term set is expanded once, then the subject term set is expanded again until all the subject term sets are successfully identified.
- the technical solution of the embodiment of the present application calculates the similarity between each comment to be identified and the target article by using the calculation weight of each of the multiple keywords in the keyword set, and according to the similarity calculation results, in multiple Alternative spam comments and normal comments are identified in the comments to be identified, if it is determined that there are unidentified comments to be identified in the plurality of comments to be identified, after updating the subject word set, the unidentified unidentified ones are identified again.
- the comments are identified until all the comments to be identified are identified, so that multiple rounds of automatic identification of spam comments in Internet comment information can be realized, and the identification effect of spam comments is improved.
- Fig. 2 is the flow chart of the method for identifying spam comments provided by the second embodiment of the present application.
- This embodiment is based on the above-mentioned embodiment for refinement, wherein obtaining a plurality of comments to be identified corresponding to the target article includes: obtaining and All the comments corresponding to the target article, and each comment is matched with the network common words dictionary; according to the matching results, the alternative spam comments, the alternative normal comments, and the unrecognized comments are obtained, and the alternative normal comments and the described The comment cannot be identified, and it is determined as the plurality of comments to be identified.
- the number of the identified candidate spam comments is multiple; after all the comments to be identified are successfully identified, it also includes: filtering the identified multiple candidate spam comments, and according to Filter the results and identify each candidate spam comment as spam or normal comment.
- the method includes the following steps:
- Internet common words refer to many conventional words, words or phrases appearing on the Internet, for example, such words as top, refueling, support, sofa, boredom, soy sauce, occupying a seat, and pouring water; Contains thesaurus of commonly used words on the Internet.
- the length L of each comment is calculated and a threshold T is set to determine the length of the comment Evaluate, for example, 5 ⁇ T ⁇ 8; when L ⁇ T, the comment is a short comment, and the set of short comments is defined as ShorD, when L is greater than or equal to T, the comment is a non-short comment, a non-short comment
- the short comments containing commonly used words on the Internet basically do not contain words related to the content of the target article, for this kind of short comments containing common words on the Internet, using the method of text similarity to identify their categories, the effect It must be bad. Therefore, in this embodiment, aiming at the length of the comments, firstly, the short comments are identified by using the network common phrases lexicon, and then the unidentifiable comments are identified by text similarity, so that no matter whether it is a short comment or a non-short comment, it can be identified. Comment spam identified.
- S260 Go back and execute calculating the similarity between each of the plurality of unrecognized comments and the target article according to the calculation weight of each keyword in the keyword set, until all the unrecognized comments are successfully identified.
- the candidate spam comments, the normal comments and unrecognizable comments are obtained, and the normal comments and unrecognizable comments are combined
- the comment is determined as a comment to be identified, and the text similarity calculation method is used to identify the comment to be identified, and after the successful identification of all the comments to be identified, a secondary filtering process is performed on the identified multiple candidate spam comments, so that no matter Both short and non-short comments can identify spam comments, realize automatic identification of spam comments in Internet comment information, and improve the recognition effect of spam comments.
- Secondary filtering refers to the use of common Internet terms and subject words to filter the candidate spam comments again by comparing the proportion of normal words in the candidate spam comments and the proportion of spam words in the total vocabulary in the candidate spam comments.
- the proportion of the total vocabulary in the candidate spam comments is greater than or equal to a threshold, the comment is considered to be a normal comment; when the proportion of normal words to the total vocabulary in the candidate spam comments is less than the threshold, the comment is considered to be a spam comment , so as to reduce the possibility of normal comments being identified as spam comments.
- the threshold can be set according to actual requirements, which is not limited in this embodiment.
- Fig. 3a is a flow chart of the spam comment identification method provided by the third embodiment of the present application
- Fig. 3b is an overall block diagram of the spam comment identification method provided by the third embodiment of the present application. This embodiment is based on the above-mentioned embodiment.
- identifying the candidate spam comments and normal comments among the plurality of comments to be identified includes: obtaining the similarity calculation result corresponding to each comment to be identified; if it is determined that each comment to be identified If the corresponding similarity calculation result is less than or equal to the first threshold, it is determined that each comment to be identified is a candidate spam comment; if it is determined that the similarity calculation result corresponding to each comment to be identified is greater than or equal to the second threshold, Then it is determined that each comment to be identified is a normal comment; if it is determined that the similarity calculation result corresponding to each comment to be identified is greater than the first threshold and less than the second threshold, it is determined that each comment cannot be identified. The unidentified comments are successfully identified.
- the method includes the following steps:
- the calculation weight of each subject term includes: the weight of each subject term in the target article; according to the calculation weight of each subject term in a plurality of subject terms in the subject term set, calculate each The similarity between a comment to be identified and the target article P, including:
- C k represents the vector of the kth comment to be identified
- P represents the vector of the target article
- n is the dimension of the vector
- w i represents the weight of the topic word i in the target article
- w ik represents the kth of the topic word i
- S i represents the semantic information between words, in the first round of similarity calculation between the comments to be identified and the target article
- S i is 1, and in the remaining rounds
- Sim(P i ,C i'k ) indicates the similarity score between the subject word i in the kth comment to be recognized and the synonym i' of the subject term i in the target article, if they are the same word, then the value is 1;
- LenP is the number of subject words in the target article
- Same(P,C k ) is the number of subject words or synonyms of the subject words in the target article appearing in the kth comment to be recognized. because The value is a number not greater
- n is the dimension of the vector C k of the kth comment to be identified and the vector P of the target article, and n is numerically equal to the number of subject words.
- this embodiment proposes a method based on the similarity between words, word position information and
- the improved cosine similarity formula of morphological similarity is a method to calculate the similarity between the comment to be recognized and the target article.
- the improved specific formula is as follows:
- the first threshold refers to a preset value used to evaluate the comment to be identified as a candidate spam comment.
- the first threshold may be set according to specific actual requirements, which is not limited in this embodiment.
- the second threshold refers to a preset value used to evaluate the comment to be identified as a normal comment.
- the second threshold may be set according to specific actual requirements, which is not limited in this embodiment.
- the first threshold is set to a value b
- the second threshold is set to a value
- the similarity calculation result corresponding to the comment to be identified is less than or equal to b
- the comment to be identified is determined to be an alternative spam comment
- the similarity calculation result corresponding to the comment to be identified is greater than or equal to a, it is determined that the comment to be identified is a normal comment; Review for successful identification.
- the subject term set is expanded, including: obtaining the high-frequency terms included in the identified normal comments, and using the high-frequency terms as new subject terms Add to the set of keywords, and carry out weight setting for newly added keywords; count the frequency of occurrence of each keyword included in the new keyword set in the target article, according to the frequency of occurrence, in the co-occurrence words
- the co-occurrence words associated with the occurrence frequency of at least one of the words in the new headword set obtained through library matching are added to the headword set as new words, and weight settings are performed for the newly added words.
- Weight'(t r ) is the adjusted weight of high-frequency words; t r is the word that appears in normal comments; T(t r ) is the weight of word t r in normal comments, and the calculation formula is 1+log10 (1+n k ); T(k) is the number of normal comments including word t r after K rounds of similarity comparisons; N(k) is the total number of normal comments after K rounds of similarity comparisons.
- the calculation weights of multiple keywords in the keyword set are updated, including:
- the calculated weight is updated for each subject term in the subject term set.
- Weight'(i) is the calculation weight of the updated word i
- the word i is the subject word in the subject word set of the target article
- the synonym i' is the synonym of the word i appearing in the normal comment
- n p is the word i in the The number of occurrences in the target article
- n k is the number of times word i appears in normal comments after K rounds of similarity comparison
- T(k) is the number of normal comments including word i after K rounds of similarity comparison
- N (k) is the total number of normal comments after K rounds of similarity comparison
- N i' is the set of synonyms of word i
- Weight(i') is the weight of the synonyms i' of word i in all normal comments
- Sim( i, i') is the similarity score between word i and synonym i'
- ⁇ is an adjustment factor greater than 0, which adjusts the weight value and similarity of each synonym in the synonym set of word i to the effect of word i The degree of influence of
- 1+log(1+n p + nk ) represents the word frequency of the word
- adding 1 to the logarithm is to avoid the value calculated by the logarithm from zero, because the value of n p is likely to be 0, at this time
- the value of n k is 1, the value calculated by log(n p +n k ) is 0, so add 1 to the logarithm, and, since the value calculated by the logarithm is generally less than 1, it will make The decrease of the value of the whole formula may have an adverse effect on the subsequent comment classification, so in this embodiment, 1 is added before the logarithm.
- In the formula Indicates the ratio of the number of normal comments where word i appears to the total number of normal comments.
- the calculation weight of the synonym i' of the term i may be adjusted according to the following formula, so as to realize the adjustment of the calculation weight of the subject term i.
- i' is the synonym of the word i in the subject word set that appears in the comment
- T(i') is the weight of the word i' in the normal comment, and the calculation formula is 1+log10(1+n k );
- T(k) is the number of normal comments containing word i' after K rounds of similarity comparison;
- N(k) is the total number of normal comments that appear in word i' after K rounds of similarity comparison;
- T(i p ) is the number of normal comments for word i'
- the weight of the synonym i p of ', the word i p is the word in the subject word set before calculating the weight adjustment.
- N p is a set of synonyms of word i', which is obtained from the subject word set of the target article.
- the above technical solution can make the recognition result of the comment to be recognized more accurate and reliable by continuously adjusting the calculation weight of the subject words.
- the candidate spam comments, the normal comments and unrecognizable comments are obtained, and the normal comments and unrecognizable comments are combined
- the comment is determined to be a comment to be identified, and on the basis of setting the first threshold and the second threshold, the text similarity calculation method is used to identify the comment to be identified, and after all the comments to be identified are successfully identified, the alternative spam comment Secondary filtering is performed, so that no matter whether it is a short comment or a non-short comment, the spam comment can be identified, and the automatic identification of the spam comment in the Internet comment information is realized, and the identification effect of the spam comment is improved.
- FIG. 4 is a schematic structural diagram of an apparatus for identifying spam comments provided in Embodiment 4 of the present application.
- the apparatus can implement a method for identifying spam comments involved in the above-mentioned embodiments.
- This device can be implemented in the form of software and/or hardware.
- the identification device of the spam comments includes: similarity calculation module 410, comment identification module 420, subject term set update module 430, comment success identification module 440.
- the similarity calculation module 410 is configured to obtain a plurality of comments to be identified and keyword sets corresponding to the target article, and according to the calculation weight of each keyword in the plurality of keywords in the keyword set, calculate the relationship between each comment to be identified and The similarity between the target articles; the comment identification module 420 is set to identify alternative spam comments and normal comments in the plurality of comments to be identified according to the similarity calculation result; the subject term set update module 430 is set to if It is determined that there are unidentified comments to be identified among the plurality of comments to be identified, and then according to the identified normal comments, at least one of the following operations is performed on the set of subject terms to obtain a new set of subject terms as the Subject term set: expand the subject term set, and update the calculation weights of the plurality of subject terms in the subject term set, and update the plurality of comments to be identified as the unsuccessful Identify the comment to be identified; the comment success identification module 440 is configured to return to execute the calculation weight of each keyword in the subject word set, and calculate the weight between each of
- the technical solution of the embodiment of the present application calculates the similarity between each comment to be identified and the target article by using the calculation weight of each keyword in the keyword set, and according to the similarity calculation result, among multiple comments to be identified Identify alternative spam comments and normal comments. If it is determined that there are unidentified unrecognized comments, after updating the subject word set, re-identify unrecognized unrecognized comments until all unrecognized comments are identified categories, which can realize multiple rounds of automatic identification of spam comments in Internet comment information, and improve the identification effect of spam comments.
- the similarity calculation module 410 is configured to obtain all comments corresponding to the target article, and match each comment with the dictionary of commonly used words in the network; obtain alternative spam comments, alternative normal comments and The comment cannot be identified, and the candidate normal comment and the unidentified comment are determined as the plurality of comments to be identified.
- the number of identified alternative spam comments is multiple; the identification device for spam comments also includes a secondary filtering module, which is configured to filter the identified multiple comments after all comments to be identified are successfully identified. Each candidate spam comment is filtered, and each identified candidate spam comment is identified as a spam comment or a normal comment according to the filtering results.
- the comment identification module 420 is configured to obtain a similarity calculation result corresponding to each comment to be identified; if it is determined that the similarity calculation result corresponding to each comment to be identified is less than or equal to the first threshold, then determine Each comment to be identified is an alternative spam comment; if it is determined that the similarity calculation result corresponding to each comment to be identified is greater than or equal to a second threshold, then it is determined that each comment to be identified is a normal comment; if it is determined If the similarity calculation result of each comment to be identified is greater than the first threshold and less than the second threshold, then it is determined that each comment to be identified has not been successfully identified.
- the calculation weight of each subject term includes: the weight of each subject term in the target article; the similarity calculation module 410 is set to use the formula Calculate the similarity between each comment to be identified and the target article; among them, C k represents the vector of the kth comment to be recognized, P represents the vector of the target article, n is the dimension of the vector, and w i represents the subject word i in The weight in the target article, w ik represents the weight of the topic word i in the k comment to be recognized, and S i represents the semantic information between words.
- the subject term set update module 430 is configured to obtain the high-frequency terms included in the identified normal comments, add the high-frequency terms as new subject terms to the subject term set, and The newly added subject words carry out weight setting; Count the frequency of occurrence of each subject word included in the target article in the new subject words set, according to the frequency of occurrence, match the co-occurrence word thesaurus to obtain the new subject words The co-occurrence words associated with the frequency of occurrence of at least one subject term in the set are added to the subject term set as new subject terms, and weight settings are performed for the newly added subject terms.
- the subject term set update module 430 is set to use the formula Update the calculated weight of each subject term in the subject term set; among them, Weight'(i) is the calculated weight of the updated term i, term i is the subject term in the subject term set of the target article, and the synonym i' is the normal Synonyms of word i appearing in comments; n p is the number of times word i appears in the target article; n k is the number of times word i appears in normal comments after K rounds of similarity comparison; T(k) is K rounds of similarity After degree comparison, the number of normal comments including word i; N(k) is the total number of normal comments after K rounds of similarity comparison; N i' is the set of synonyms for word i, and Weight(i') is the The weight of i's synonym i' in all normal comments, Sim(i,i') is the similarity score between word i and synonym i', ⁇ is an adjustment factor greater than 0.
- the spam comment identification device provided in the embodiment of the present application can execute the spam comment identification method provided in any embodiment of the present application, and has corresponding functional modules for executing the method.
- FIG. 5 is a schematic structural diagram of a computer device provided in Embodiment 5 of the present application.
- the computer device includes a processor 510, a memory 520, an input device 530, and an output device 540;
- the quantity can be one or more, and a processor 510 is taken as an example in FIG. Take the bus connection as an example.
- the memory 520 can be used to store software programs, computer-executable programs and modules, such as program instructions/modules corresponding to the method for identifying spam comments in the embodiment of the present application (for example, the device for identifying spam comments).
- the processor 510 executes various functional applications and data processing of the computer device by running the software programs, instructions and modules stored in the memory 520 , that is, realizes the above-mentioned method for identifying spam comments.
- the method includes: obtaining a plurality of to-be-recognized comments and subject word sets corresponding to the target article, and calculating the relationship between each to-be-identified comment and the target article according to the calculation weight of each subject word in the plurality of subject words in the subject word set the similarity; according to the similarity calculation result, identify alternative spam comments and normal comments in the plurality of comments to be identified; if it is determined that there are unidentified comments to be identified in the plurality of comments to be identified, then according to For the identified normal comments, at least one of the following operations is performed on the set of subject terms to obtain a new set of subject terms as the set of subject terms: subject term expansion is performed on the subject term set, and the subject The calculation weights of the plurality of subject terms in the term set are updated, and the plurality of comments to be identified are updated as the unsuccessfully identified comments to be identified; return to execute according to the calculation weight of each subject term in the subject term set , calculating the similarity between each of the plurality of unidentified comments and
- the memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system and an application program required by at least one function; the data storage area may store data created according to the use of the terminal, and the like.
- the memory 520 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage devices.
- the memory 520 may further include memory located remotely from the processor 510, and these remote memories may be connected to the computer device through a network. Examples of the aforementioned networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
- the input device 530 can be used to receive input numbers or character information, and generate key signal input related to user settings and function control of the computer equipment.
- the output device 540 may include a display device such as a display screen.
- Embodiment 6 of the present application also provides a storage medium containing computer-executable instructions, the computer-executable instructions are used to implement a method for identifying spam comments when executed by a computer processor, the method includes: obtaining and target articles Corresponding multiple to-be-recognized comments and keyword sets, and according to the calculation weight of each keyword in the multiple keyword sets in the keyword set, calculate the similarity between each to-be-identified comment and the target article; calculate according to the similarity As a result, alternative spam comments and normal comments are identified in the plurality of comments to be identified; Performing at least one of the following operations on the subject term set to obtain a new subject term set as the subject term set: performing subject term expansion on the subject term set, and performing the subject term expansion on the plurality of subject term sets in the subject term set The calculation weight of the subject words is updated, and the plurality of comments to be identified are updated to the comments to be identified that have not been successfully identified; return to execute according to the calculation weight of each subject term in the subject term set, calculate
- a storage medium containing computer-executable instructions provided in the embodiments of the present application
- the computer-executable instructions are not limited to the method operations described above, and can also implement the spam comment identification method provided in any embodiment of the present application Related operations in .
- the present application can be implemented by means of software and necessary general-purpose hardware, and of course it can also be implemented by hardware, but in many cases the former is a better implementation mode .
- the technical solution of the present application can be embodied in the form of a software product in essence or the part that contributes to the prior art, and the computer software product can be stored in a computer-readable storage medium, such as a floppy disk of a computer , read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or optical disc, etc., including several instructions to make a computer device (which can be a personal computer) , server, or network device, etc.) execute the methods described in multiple embodiments of the present application.
- a computer-readable storage medium such as a floppy disk of a computer , read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), flash memory (FLASH), hard disk or optical disc, etc.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Sont divulgués dans les modes de réalisation de la présente invention un procédé et un appareil d'identification de commentaire de pourriel, et un dispositif et un support. Le procédé comprend : l'acquisition d'une pluralité de commentaires à identifier et d'un ensemble de mots sujets, qui correspondent à un article cible, et le calcul de la similarité entre chaque commentaire à identifier et l'article cible en fonction du poids de calcul de chaque mot sujet parmi une pluralité de mots sujets dans l'ensemble de mots sujets ; l'identification, en fonction d'un résultat de calcul de similarité, de commentaires de pourriel candidats et de commentaires normaux à partir de la pluralité de commentaires à identifier ; lorsqu'il est déterminé qu'il y a des commentaires à identifier qui n'ont pas été identifiés avec succès parmi la pluralité de commentaires à identifier, et en fonction des commentaires normaux identifiés, la réalisation d'au moins une des opérations suivantes sur l'ensemble de mots sujets pour obtenir un nouvel ensemble de mots sujets pour servir d'ensemble de mots sujets : la réalisation d'une augmentation de mot sujet sur l'ensemble de mots sujets, la mise à jour des poids de calcul de la pluralité de mots sujets dans l'ensemble de mots sujets et la mise à jour de la pluralité de commentaires devant être identifiés aux commentaires devant être identifiés qui n'ont pas été identifiés avec succès ; et le renvoi à l'exécution de l'étape de calcul de la similarité entre chacun de la pluralité de commentaires devant être identifiés et de l'article cible selon le poids de calcul de chacun des mots sujets dans l'ensemble de mots sujets, jusqu'à ce que tous les commentaires à identifier soient identifiés avec succès.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110925078.4A CN113656580B (zh) | 2021-08-12 | 2021-08-12 | 垃圾评论的识别方法、装置、设备及介质 |
CN202110925078.4 | 2021-08-12 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2023016267A1 true WO2023016267A1 (fr) | 2023-02-16 |
Family
ID=78491540
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2022/108563 WO2023016267A1 (fr) | 2021-08-12 | 2022-07-28 | Procédé et appareil d'identification de commentaire de pourriel, et dispositif et support |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113656580B (fr) |
WO (1) | WO2023016267A1 (fr) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113656580B (zh) * | 2021-08-12 | 2024-08-06 | 北京锐安科技有限公司 | 垃圾评论的识别方法、装置、设备及介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8832116B1 (en) * | 2012-01-11 | 2014-09-09 | Google Inc. | Using mobile application logs to measure and maintain accuracy of business information |
CN109783616A (zh) * | 2018-12-03 | 2019-05-21 | 广东蔚海数问大数据科技有限公司 | 一种文本主题提取方法、系统和存储介质 |
CN110209795A (zh) * | 2018-06-11 | 2019-09-06 | 腾讯科技(深圳)有限公司 | 评论识别方法、装置、计算机可读存储介质和计算机设备 |
CN112559685A (zh) * | 2020-12-11 | 2021-03-26 | 芜湖汽车前瞻技术研究院有限公司 | 汽车论坛垃圾评论识别方法 |
CN113656580A (zh) * | 2021-08-12 | 2021-11-16 | 北京锐安科技有限公司 | 垃圾评论的识别方法、装置、设备及介质 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102254038B (zh) * | 2011-08-11 | 2013-01-23 | 武汉安问科技发展有限责任公司 | 一种分析网络评论相关度的系统及其分析方法 |
CN103226576A (zh) * | 2013-04-01 | 2013-07-31 | 杭州电子科技大学 | 基于语义相似度的垃圾评论过滤方法 |
CN109902179A (zh) * | 2019-03-04 | 2019-06-18 | 上海宝尊电子商务有限公司 | 基于自然语言处理的筛选电商垃圾评论的方法 |
CN111125305A (zh) * | 2019-12-05 | 2020-05-08 | 东软集团股份有限公司 | 热门话题确定方法、装置、存储介质及电子设备 |
-
2021
- 2021-08-12 CN CN202110925078.4A patent/CN113656580B/zh active Active
-
2022
- 2022-07-28 WO PCT/CN2022/108563 patent/WO2023016267A1/fr active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8832116B1 (en) * | 2012-01-11 | 2014-09-09 | Google Inc. | Using mobile application logs to measure and maintain accuracy of business information |
CN110209795A (zh) * | 2018-06-11 | 2019-09-06 | 腾讯科技(深圳)有限公司 | 评论识别方法、装置、计算机可读存储介质和计算机设备 |
CN109783616A (zh) * | 2018-12-03 | 2019-05-21 | 广东蔚海数问大数据科技有限公司 | 一种文本主题提取方法、系统和存储介质 |
CN112559685A (zh) * | 2020-12-11 | 2021-03-26 | 芜湖汽车前瞻技术研究院有限公司 | 汽车论坛垃圾评论识别方法 |
CN113656580A (zh) * | 2021-08-12 | 2021-11-16 | 北京锐安科技有限公司 | 垃圾评论的识别方法、装置、设备及介质 |
Also Published As
Publication number | Publication date |
---|---|
CN113656580B (zh) | 2024-08-06 |
CN113656580A (zh) | 2021-11-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108509425B (zh) | 一种基于新颖度的中文新词发现方法 | |
CN109299480B (zh) | 基于上下文语境的术语翻译方法及装置 | |
WO2020140373A1 (fr) | Procédé de reconnaissance d'intention, dispositif de reconnaissance et support d'informations lisible par ordinateur | |
CN112364641B (zh) | 一种面向文本审核的中文对抗样本生成方法及装置 | |
CN108763213A (zh) | 主题特征文本关键词提取方法 | |
WO2019218527A1 (fr) | Procédé et appareil de traitement de langage naturel combiné multisystème | |
CN108009135B (zh) | 生成文档摘要的方法和装置 | |
CN106909655A (zh) | 基于产生式别名挖掘的知识图谱实体发现和链接方法 | |
CN108920599B (zh) | 一种基于知识本体库的问答系统答案精准定位和抽取方法 | |
CN111460170B (zh) | 一种词语识别方法、装置、终端设备及存储介质 | |
CN109472022B (zh) | 基于机器学习的新词识别方法及终端设备 | |
CN111241813B (zh) | 语料扩展方法、装置、设备及介质 | |
WO2017020454A1 (fr) | Procédé et appareil de recherche | |
CN112989235B (zh) | 基于知识库的内链构建方法、装置、设备和存储介质 | |
WO2023016267A1 (fr) | Procédé et appareil d'identification de commentaire de pourriel, et dispositif et support | |
CN106815190B (zh) | 一种词语识别方法、装置及服务器 | |
CN110287493B (zh) | 风险短语识别方法、装置、电子设备及存储介质 | |
CN113934848B (zh) | 一种数据分类方法、装置和电子设备 | |
CN109977397B (zh) | 基于词性组合的新闻热点提取方法、系统及存储介质 | |
CN113191145B (zh) | 关键词的处理方法、装置、电子设备和介质 | |
CN107239455A (zh) | 核心词识别方法及装置 | |
CN110874408A (zh) | 模型训练方法、文本识别方法、装置及计算设备 | |
CN110489759B (zh) | 基于词频的文本特征加权及短文本相似性计算方法、系统和介质 | |
CN112182332A (zh) | 一种基于爬虫采集的情感分类方法及系统 | |
CN111966869A (zh) | 短语提取方法、装置、电子设备及存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22855250 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 22855250 Country of ref document: EP Kind code of ref document: A1 |