CN110019681B

CN110019681B - Comment content filtering method and system

Info

Publication number: CN110019681B
Application number: CN201711373559.9A
Authority: CN
Inventors: 杨华涛
Original assignee: Alibaba China Co Ltd
Current assignee: Alibaba China Co Ltd
Priority date: 2017-12-19
Filing date: 2017-12-19
Publication date: 2022-05-17
Anticipated expiration: 2037-12-19
Also published as: CN110019681A

Abstract

The embodiment of the application discloses a comment content filtering method and a comment content filtering system, and the comment content filtering method comprises the following steps: performing word segmentation processing on all comments of a comment main body to obtain a word sequence of the comments; determining the relevance between any two words in the comment according to the word vector corresponding to each word in the comment, determining the meaningful probability of the comment content by using the relevance between all the words in the comment, and filtering the comment corresponding to the probability less than or equal to a threshold value. According to the technical scheme, the contents which have no practical significance and are input in a messy mode are filtered, the purpose of refining and reducing noise is achieved for the text contents, and the efficiency of obtaining valuable comment contents is greatly improved.

Description

Comment content filtering method and system

Technical Field

The application relates to the technical field of internet, in particular to a comment content filtering method and system.

Background

With the rapid development of internet technology, users have a variety of interactions via the internet. Such as: the user can make comments in the comment column below the commented subject, and other users can interact with the comments in the comment area. The comment is information expressing conditions of some characteristics of the comment main body and the emotion of the user individual on the comment main body. The user can know the comment subject according to the comment content, and can also exchange information with other users for the same comment subject.

Currently, when comments are analyzed, since there are a large number of comment contents for the same comment subject, there are contents with high repetition rate and without practical significance mixed in the comments, such as: a sofa is provided. Even some comment areas appear many meaningless sentences entered in disorder, such as: "and property appear to be the breaths of four children perhaps feeling your abdomen easy looking at your west-safe sunburn". Due to the existence of the comment content with high repetition rate and no practical significance, valuable text content in the comment area is submerged, and the efficiency of obtaining effective comment content in the comment area is low.

Disclosure of Invention

The embodiment of the application aims to provide a comment content filtering method and system, which are suitable for comment content filtering processing of comments, barracks, posts and the like, and solve the technical problem that the efficiency of obtaining valuable comment content is reduced due to unconscious comment content which is input disorderly.

In order to achieve the above object, an embodiment of the present application provides a comment content filtering method, including:

performing word segmentation processing on all comments of a comment main body to obtain a word sequence of the comments;

determining the relevance between any two words in the comment according to the word vector corresponding to each word in the comment, determining the meaningful probability of the comment content by using the relevance between all the words in the comment, and filtering the comment corresponding to the probability less than or equal to a threshold value.

In order to achieve the above object, an embodiment of the present application further provides a comment content filtering system, where the comment content filtering system includes: a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, performs the functions of:

Therefore, compared with the prior art, the technical scheme provided by the application filters out the content which has no practical significance and is input in disorder in the text content, achieves the purposes of refining and reducing noise of the text content, and greatly improves the efficiency of finally obtaining valuable comment content.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a comment content filtering method according to an embodiment of the present application;

fig. 2 is a second flowchart of a comment content filtering method according to an embodiment of the present application;

fig. 3 is a third flowchart of a comment content filtering method according to an embodiment of the present application;

fig. 4 is a schematic diagram of a comment content filtering system according to an embodiment of the present application.

Detailed Description

In order to make those skilled in the art better understand the technical solutions in the present application, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art without any inventive work based on the embodiments in the present application shall fall within the scope of protection of the present application.

The application provides a comment content filtering method, which takes a comment subject as a scope, and performs filtering processing on all comments of the comment subject according to the steps shown in fig. 1, so as to achieve the purpose of refining and reducing noise. The method can be applied to terminal equipment with a data processing function. The terminal device may be, for example, a desktop computer, a notebook computer, a tablet computer, a workstation, etc. The method may comprise the steps of:

s11: and performing word segmentation processing on all comments of the comment main body to obtain a word sequence of the comments.

In this embodiment, a word segmenter is used to perform word segmentation processing on all the comment contents, and the word segmenter may select an open source word segmenter, such as a word segmenter, an IK segmenter, or the like.

S12: determining the relevancy between any two words in the comment according to the word vector corresponding to each word in the comment, determining the meaningful probability of the comment content by using the relevancy between all words in the comment, and filtering the comment corresponding to the probability less than or equal to a threshold value.

In practice, the comment area of the comment body has many meaningless comment contents to be input indiscriminately. Such as: "buy better royal horse butterbur usa and care and never only tubercle bacillus for several years basically not good to go home and plan too many stars and avoid department according to domestic version. Can be used in case of emergency or busy. And (6) carrying out the following steps. '". Typically, such meaningless outputs are filtered using a bayesian algorithm. However, as can be seen from analysis, characteristic token strings are rarely extracted from such confusingly input meaningless comment contents, so that it is difficult to establish a confusingly input meaningless statement data sample library, and finally the effect of filtering using the bayesian algorithm is poor.

Research shows that the problem which cannot be solved by the Bayesian algorithm can be solved with great possibility if the neural network is introduced to calculate the relevancy of the words. In practical application, a large number of text content data samples are trained through a neural network to establish a recognition model. The model can identify the probability of two words appearing in the same context, which is the degree of correlation between the two words. For the degree of correlation, the highest degree of correlation is 1, i.e. two words are identical, and the lowest degree of correlation is 0, i.e. two words do not appear simultaneously in any training context, so that the calculated degree of correlation between all words in a complete sentence determines that the sentence is meaningful and must have a value between 0 and 1. The higher the value, the greater the probability of meaningful sentences, the lower the value, the greater the probability of meaningless sentences, and a threshold is set to filter meaningless sentences.

In application, each word sequence of the comment content to be filtered is input into the recognition model, and if words unrelated to the comment subject appear in the same word, the context between the words is different, so that the relevance of the words is reduced. Such as: in the case of conus video conuttatus, terms such as "royal horse" and "tubercle bacillus" appear in the content of the review of conuttatus, and the terms are different from the context of the terms for the review of conuttatus, so that the relevance between the terms such as "royal horse" and "tubercle bacillus" and the other terms for the review of conuttatus is reduced, and the meaningful probability of the sentence obtained according to the relevance between all the terms of one sentence in the content of the review is reduced. In practice, such comments may be purposely input by the user in a mess. Then, the present solution filters out such comments.

Based on the above description, in this embodiment, all comments of the comment body are subjected to word segmentation processing, and a word sequence of each comment is obtained. Then, words in the word sequence are converted into word vectors, the word vectors are used as input of an identification model, the correlation degree of the words is obtained through processing of the identification model, the identification model determines the meaningful probability of the sentences according to the correlation degree of the words, and comments corresponding to the probability which is smaller than or equal to a threshold value are filtered. For example: "i is a man and you is a woman", the word vector of each word after the word segmentation processing is used as the input of the recognition model, the meaningful probability of the word is obtained through the recognition model processing as 0.71428573, and the probability is larger than the set threshold value. The larger the probability is, the more the sentence accords with the Chinese expression word sequence, and the sentence has practical significance and does not belong to the nonsense sentences input in disorder. In actual operation, the threshold is set according to actual conditions.

Fig. 2 is a flowchart of another comment content filtering method proposed in the embodiment of the present application. On the basis of fig. 1, the filtering method further includes:

s13: and filtering the word sequences of the remaining comments after the filtering process again according to the high-frequency word bank.

In this embodiment, there are many comments for the same comment subject, taking "huaqian bone" played on the youku platform as an example, in a video comment area of "huaqian bone", some comment contents are: a sofa is provided. Some comment contents are: zhao Li Ying performs its skill in the thousands of bones of the flower, refuel! Zhao Li Ying. Wherein, the terms of sofa, oil filling and the like have no relation with the video of Huaqian bone. Also, the frequency of occurrence of the term "refuel" is particularly high. After the word segmentation processing, the comment of 'sofa' is filtered out, and the word of 'oil filling' in the comment content is deleted.

In this embodiment, the high-frequency thesaurus can be obtained by performing word segmentation statistical screening on mass comment data samples of different comment subjects. Words in the word stock have no practical meaning with the main body of the comment, and high frequency appears in the text content, so that the noise is filtered. Specifically, taking the Youku video as an example, the high-frequency word bank design can randomly obtain more than 100 ten thousand of comment data from the Youku global website comment database, perform word frequency statistics on the comment texts after word segmentation processing, and then set a word frequency threshold to obtain high-frequency words, wherein the high-frequency words have no practical significance to the comment main body. The word frequency threshold of the high-frequency words can be dynamically adjusted according to the statistical result. For example: the words such as sofa, advertisement, rubbish, refuel, thank you and the like all belong to words in a high-frequency word bank and are irrelevant to a comment main body, and the high-frequency words belong to noise in comment contents.

In this embodiment, the word sequence after the high-frequency thesaurus filtering process may further be matched with a deactivated thesaurus, and if the comment content includes a stop word, the stop word is filtered from the word sequence of the comment. In practice, stop words can be regarded as a special class of high-frequency words, and the stop word library includes: numbers, letters, punctuation, emoji, fictitious words, and the like. In this embodiment, the stop word may be defined by itself or may be obtained from an open source thesaurus. At present, common open source word segmentation is provided with a disabled word bank.

Fig. 3 shows a third flowchart of a comment content filtering method provided in the embodiment of the present application. On the basis of fig. 2, the method for word segmentation processing further includes:

s14: and counting the residual words in the word sequence after the high-frequency word bank is filtered, and updating the high-frequency word bank according to the counting result.

In this embodiment, taking the example of "flower bone" of the television drama playing youku as an example, suppose that: "Zhao Li Ying performs skill in the thousand bones of the flower, refuel! Zhao Li Ying, love you! After word segmentation, the word-word sequence is obtained as { Zhao Li Ying, in …, Huaqian Gu, rehearsal, Ji, Jiang, Zhao Li Ying, ai you }. Matching the word sequence with a high-frequency word stock to obtain a high-frequency word 'oiling', deleting the word 'oiling' from the word sequence to obtain a personalized word set of the word sequence, wherein the personalized word set is { Zhao Liying, in …, Huaqian bone, rehearsal, Ji, Zhao Liying, love you }, and counting the personalized word set. The word sequence of the comment of the television drama 'Huaqian Gu' is counted for many times, the words 'Zhao Li Ying' and 'ai you' are very high in word frequency, but the word 'Zo Li Ying' is related to the comment main body, the word 'ai you' has no actual relation with the comment main body, and the word 'ai you' is put into a high-frequency word stock. Words in the high-frequency word bank are updated according to actual conditions, so that high-frequency words which have no actual meaning with the comment main body are accurately filtered.

Referring to fig. 4, the present application further provides a comment content filtering system. The system comprises: a memory a and a processor b, wherein the memory a stores a computer program, and the computer program realizes the following functions when being executed by the processor b:

In this embodiment, when executed by the processor, the computer program further implements the following functions:

and filtering the word sequences of the remaining comments after the filtering process again according to the high-frequency word bank.

and counting the residual words in the word sequence after the high-frequency word bank is filtered, and updating the high-frequency word bank according to the counting result.

after high-frequency word bank filtering processing, matching the rest words in the word sequence with a stop word bank, and filtering stop words from the comment word sequence according to a matching result; the deactivation word stock is obtained through an open source word stock or self-defined.

In this embodiment, a high frequency lexicon is obtained, and the computer program, when executed by the processor, implements the following functions:

and when the word frequency statistics is carried out on the words of the sample comment content of the comment subject, the words which are larger than the word frequency threshold and have no practical meaning with the comment subject form a high-frequency word bank.

In this embodiment, the word sequence of the comments left after the filtering process is filtered again according to the high-frequency thesaurus, and when the computer program is executed by the processor, the following functions are implemented:

matching the word sequences of the comments left after the filtering treatment with the high-frequency word bank, and filtering high-frequency words from the word sequences according to the matching result.

In this embodiment, the Memory includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard Disk Drive (HDD), or a Memory Card (Memory Card).

In this embodiment, the processor may be implemented in any suitable manner. For example, the processor may take the form of, for example, a microprocessor or processor and a computer-readable medium that stores computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, an embedded microcontroller, and so forth.

The specific functions implemented by the memory and the processor of the comment content filtering system provided in the embodiment of the present specification may be explained in comparison with the foregoing embodiments in the present specification, and can achieve the technical effects of the foregoing embodiments, and thus, no further description is given here.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardbylangue (Hardware Description Language), vhjhdul (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry for implementing the logical method flows can be readily obtained by a mere need to program the method flows with some of the hardware description languages described above and into an integrated circuit.

Those skilled in the art will also appreciate that, in addition to implementing a client, server as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the client, server are in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a client and a server may be regarded as a hardware component, and a device included therein for implementing various functions may be regarded as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

From the above description of the embodiments, it is clear to those skilled in the art that the present application can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present application may be essentially or partially implemented in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on differences from other embodiments. In particular, for the embodiments of the client, reference may be made to the introduction of the embodiments of the method described above for a comparative explanation.

The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Although the present application has been described in terms of embodiments, those of ordinary skill in the art will recognize that there are numerous variations and permutations of the present application without departing from the spirit of the application, and it is intended that the appended claims encompass such variations and permutations without departing from the spirit of the application.

Claims

1. A comment content filtering method, comprising:

determining the relevance between any two words in the comment according to a word vector corresponding to each word in the comment, determining the meaningful probability of the comment content by using the relevance between all the words in the comment, and filtering the comment corresponding to the probability less than or equal to a threshold value, wherein the relevance is used for representing the probability that two words appear in the same context, and the more different the contexts between the words are, the smaller the relevance of the words is;

and filtering the word sequences of the rest comments after the filtering treatment again according to a high-frequency word bank, wherein the high-frequency words in the high-frequency word bank are formed by words which are larger than a word frequency threshold value and have no practical meaning with the comment main body when the word frequency statistics is carried out on the words of the sample comment content of the comment main body.

2. The method of claim 1, further comprising:

3. The method of claim 1, further comprising:

4. The method of claim 1, wherein the filtering again of the word sequences of the comments left after the filtering process according to the high frequency thesaurus is performed by:

matching the word sequences of the remaining comments after the filtering processing with the high-frequency word bank, and filtering out high-frequency words from the word sequences according to the matching result.

5. The method of claim 3, wherein the stop words comprise numbers, letters, punctuation marks, emoji, fictitious words.

6. A review content filtering system, the system comprising: a memory and a processor, the memory having stored therein a computer program that, when executed by the processor, performs the functions of:

filtering the word sequences of the remaining comments after the filtering process again according to the high-frequency word bank;

obtaining a high frequency lexicon, the computer program, when executed by the processor, implementing the following functions: and when the word frequency statistics is carried out on the words of the sample comment content of the comment subject, the words which are larger than the word frequency threshold and have no practical meaning with the comment subject form a high-frequency word bank.

7. The system of claim 6, wherein the computer program, when executed by the processor, further performs the functions of:

8. The system of claim 6, wherein the computer program, when executed by the processor, further performs the functions of:

9. The system of claim 6, wherein the filtering process is performed again on sequences of terms of the comments remaining after the filtering process according to a high frequency thesaurus, and wherein the computer program, when executed by the processor, performs the following functions: