WO2021169499A1

WO2021169499A1 - Network bad data monitoring method, apparatus and system, and storage medium

Info

Publication number: WO2021169499A1
Application number: PCT/CN2020/136403
Authority: WO
Inventors: 张国辉; 钱柏丞
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-02-26
Filing date: 2020-12-15
Publication date: 2021-09-02
Also published as: CN111400439A

Abstract

A network bad data monitoring method and apparatus, and a computer readable storage medium. The method comprises: performing word segmentation on a target text (S110); comparing words in a word segmentation set with a preset bad vocabulary comparison table, screening out bad words from the word segmentation set, and loading the bad words into a first bad vocabulary list (S120); by means of a word similarity calculation formula, calculating a mean similarity of each word to be selected, and loading the word to be selected the mean similarity of which is greater than a preset similarity threshold into the first bad vocabulary list (S130); screening out words that do not satisfy a preset bad word emotion tendency rule by using a sentiment analysis algorithm (S140); and screening out words that do not conform to a bad vocabulary sentence position structure by means of a word position structure method (S150). The method can more accurately discover unregistered bad vocabulary, and in comparison with the existing technology, the precision and accuracy of recorded bad vocabulary are higher.

Description

Network bad data monitoring method, device, system and storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on February 26, 2020, the application number is 202010119614.7, and the invention title is "Network Bad Data Monitoring Method, Device, and Storage Medium", the entire content of which is incorporated by reference In this application.

Technical field

This application relates to the technical field of big data processing, and in particular to a method, device and computer-readable storage medium for monitoring network bad data.

Background technique

With the rapid development of the Internet, the era of information explosion has already arrived. As the main carrier of Internet information dissemination, Internet text has also been developed rapidly. Internet languages are changing with each passing day. At the same time, the vulgarization of Internet languages is becoming more and more serious. The monitoring and discovery of malicious Internet vocabulary is facing great challenges.

With the popularity of the Internet, various online forums, online articles, and online media continue to appear, and a large number of texts are produced every day, and there are a large number of bad words on the Internet. The inventor realizes that the biggest difficulty in network bad vocabulary monitoring lies in the fast update speed of the network language, diversified vocabulary changes, and no obvious regularity. Many detection models do not have an automatic recognition function for unregistered words, or only rely on simple similarity calculations between words to collect unregistered words. This has also led to more and more unregistered words that have not been included in the system over time, or the quality of unregistered words that have been included is getting worse and worse. This will cause the accuracy of the existing monitoring model to decrease, the effect is greatly reduced, and the unregistered bad vocabulary cannot be accurately found.

Summary of the invention

Based on the above-mentioned problems in the prior art, this application provides a method, device, and computer-readable storage medium for monitoring network bad data. The main purpose of the method is to divide each word with a preset The bad words in the bad vocabulary comparison table are compared, and the same bad words are loaded into the first bad vocabulary list. Because the bad words in the bad vocabulary comparison table are limited, there may be bad words similar to the bad words, so through the words The similarity calculation formula calculates the word segmentation in the target text again, and loads the words that meet the preset similarity threshold range into the first bad vocabulary. Since the bad words found by the similarity calculation are not certain, the emotions The analysis algorithm and word position structure method screen out the non-bad words in the first bad vocabulary, and finally output the third bad vocabulary. The unregistered bad vocabulary can be found more accurately. Compared with the prior art, the accuracy of the recorded bad vocabulary is higher and the accuracy is improved.

In the first aspect, in order to achieve the above objective, this application provides a method for monitoring network bad data, which includes:

Perform word segmentation processing on the target text to obtain a word segmentation set;

Compare the words in the word segmentation set with a preset bad vocabulary comparison table, filter out bad words from the word segmentation set, load the bad words into the first bad vocabulary list, and filter out the word segmentation set The remaining words of as candidates for selection;

Through the word similarity calculation formula, calculate the average similarity of each candidate word and the words in the preset bad vocabulary comparison table, and load the candidate words with the average similarity greater than the preset similarity threshold to the The first bad vocabulary list;

Through the sentiment analysis algorithm, the words that do not meet the preset sentiment trend rule of the bad words are screened out from the first bad vocabulary to obtain the second bad vocabulary;

Through the word position structure method, words that do not conform to the position structure of the bad vocabulary sentence are screened out from the second bad vocabulary list, and the third bad vocabulary list is obtained and output.

In a second aspect, in order to achieve the above object, the present application also provides an electronic device, the electronic device comprising: a memory, a processor, and a network bad data monitoring program is stored in the memory, and the network bad data monitoring program is When the processor executes, the following steps are implemented:

In the third aspect, in order to achieve the above objectives, in order to achieve the above objectives, this application also provides a network bad data monitoring system, including:

The word segmentation processing unit is used to perform word segmentation processing on the target text to obtain a word segmentation set;

The bad word screening unit is used to compare words in the word segmentation set with a preset bad vocabulary comparison table, filter bad words from the word segmentation set, load the bad words into the first bad vocabulary list, and load the bad words into the first bad vocabulary list. The remaining words after screening in the word segmentation set are used as candidate words;

The word similarity calculation unit is used to calculate the average similarity between each of the candidate words and the words in the preset bad vocabulary comparison table through a word similarity calculation formula, and make the average similarity greater than the preset similarity threshold Load the candidate words of to the first bad vocabulary list;

The sentiment analysis unit is used to screen out words that do not meet the preset sentiment trend rule of undesirable words from the first unhealthy vocabulary through an sentiment analysis algorithm to obtain a second unhealthy vocabulary;

The word position structure screening unit is used to filter out words that do not conform to the position structure of the bad vocabulary sentence from the second bad vocabulary list through the word position structure method to obtain and output the third bad vocabulary list.

In a fourth aspect, in order to achieve the above objective, the present application also provides a computer-readable storage medium in which a network bad data monitoring program is stored, and when the network bad data monitoring program is executed by a processor, Realize any step in the method for monitoring network bad data as described above.

The network bad data monitoring method, device and computer readable storage medium proposed in this application compare each word segment with the bad words in the preset bad vocabulary comparison table after word segmentation processing of the target text, and compare the same bad words. The words are loaded into the first bad vocabulary list. Due to the limited bad words in the bad vocabulary comparison table, there may be bad words similar to the bad words. Therefore, the word similarity calculation formula is used to calculate the word segmentation in the target text again. The words with the preset similarity threshold range are loaded into the first bad vocabulary. Since the bad words found by the similarity calculation are not certain, the sentiment analysis algorithm and word position structure method are used to analyze the non-bad words in the first bad vocabulary. Words are screened out, and finally the third bad vocabulary list is output. The unregistered bad vocabulary can be found more accurately. Compared with the prior art, the accuracy of the recorded bad vocabulary is higher and the accuracy is improved.

Description of the drawings

FIG. 1 is a flowchart of a preferred embodiment of a method for monitoring bad network data according to this application;

FIG. 2 is a schematic diagram of an application environment of a preferred embodiment of a method for monitoring bad network data according to this application;

3 is a schematic diagram of modules of a preferred embodiment of the network bad data monitoring program in FIG. 2;

Figure 4 is a system logic diagram corresponding to the method for monitoring bad network data in this application.

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

Detailed ways

It should be understood that the specific embodiments described here are only used to explain the present application, and are not used to limit the present application.

Example 1

The present application provides a method for monitoring bad network data. Referring to FIG. 1, it is a schematic diagram of an application environment of a preferred embodiment of the method for monitoring bad network data according to this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.

In this embodiment, the method for monitoring network bad data includes: step S110-step S150.

Step S110: Perform word segmentation processing on the target text to obtain a word segmentation set.

With the popularization of the Internet, there are more and more online text information. In order to maintain the order of the Internet, it is usually necessary to monitor the bad words on the Internet. When checking whether there are bad words in an online article, you need to segment the target article first. Processing and word segmentation processing is the basic step of text sentiment analysis. In the prior art, the commonly used Chinese word segmentation tools for text segmentation include:

Stuttering word segmentation, HanLP, pynlpir word segmentation, ansj word segmentation, LTP, thulac word segmentation, etc. After performing word segmentation processing on the target text, a word segmentation set is obtained.

Step S120: Compare the words in the word segmentation set with a preset bad vocabulary comparison table, filter out bad words from the word segmentation set, load the bad words into the first bad vocabulary list, and use the remaining words filtered in the word segmentation set as waiting Choose words.

Specifically, each word in the word segmentation set is compared with the bad words in the preset bad vocabulary comparison table. The preset bad vocabulary comparison table stores a large number of bad words. Through the comparison, the bad words in the word segmentation set can be determined. Words, filter out the words identified as bad words in the word segmentation set, and load them into the first bad vocabulary list.

Among them, the bad words in the preset bad vocabulary comparison table can be derived from common bad words in the Internet. When the words in the word segmentation set are compared with the preset bad vocabulary comparison table, the words in the word segmentation set are compared with the preset bad words If the bad words in the comparison table are exactly the same, the words are selected from the word segmentation set and loaded into the first bad vocabulary list. For example, if the word "mentally retarded" exists in the word segmentation set, it is also in the default bad vocabulary comparison table. If the word "mentally retarded" is recorded, the "mentally retarded" in the word segmentation set will be filtered out and recorded in the first bad vocabulary list.

Among them, the words in the word segmentation set are compared with the preset bad vocabulary comparison table, bad words are selected from the word segmentation set, the bad words are loaded into the first bad vocabulary list, and the remaining words after screening in the word segmentation set are selected as candidates The word steps include:

Input the words in the word segmentation set and the preset bad vocabulary comparison table into the preset same word screening model, and filter out bad words from the word segmentation set through the preset same word screening model;

Load bad words into the first bad vocabulary list, and use the remaining words filtered in the word segmentation set as candidate words.

Specifically, the preset same word screening model includes:

The first input layer for inputting words in the word segmentation set, the second input layer for inputting the preset bad vocabulary comparison table, the words input for the first input layer and the preset bad words input for the second input layer The same word filtering layer for comparison and analysis of the comparison table, the first output layer used to output bad words from the word segmentation set in the same word filtering layer, and the first output layer used to filter the same word filtering layer from the word segmentation set The second output layer where the remaining words after bad words are output.

Step S130: Calculate the average similarity between each candidate word and the words in the preset bad vocabulary comparison table through the word similarity calculation formula, and load the candidate words with the average similarity greater than the preset similarity threshold to the first bad word. Glossary.

Since the bad words in the preset bad vocabulary comparison table are usually bad words that have been recorded, the recorded bad words are limited. If there are bad words in the word segmentation set that are not recorded in the preset bad vocabulary comparison table, the word segmentation set The screening of bad words in is not thorough enough, so the word similarity calculation formula can filter out bad words similar to bad words in the preset bad word comparison table from the remaining words in the word segmentation set, for example, the remaining words in the word segmentation set There is the word "mentally retarded" in the presupposed bad vocabulary comparison table, but the word "mentally retarded" is not recorded, but the word "mentally retarded" is recorded. Set the comparison of similarity thresholds, and finally screen out words similar to bad words from the remaining words in the word segmentation set.

Among them, the steps of calculating the mean value of the similarity between each candidate word and the words in the predetermined bad vocabulary comparison table through the word similarity calculation formula include:

Vectorize each word to be selected to obtain the word vector of the word to be selected;

The word vector of each candidate word and the bad word vector in the preset bad word word vector set are calculated by the word similarity calculation formula to calculate the similarity, and N similarity values are obtained. Among them, the preset bad word The word vector set of is the word vector set obtained by vectorizing the words in the preset bad vocabulary comparison table;

According to the N similarity values, the mean value of the similarity between the words in the comparison table of the candidate words and the preset bad words is obtained.

Among them, according to the N similarity values, obtaining the mean similarity value of the words in the comparison table of the candidate words and the preset bad words includes:

The N similarity values are added and processed to obtain the total similarity value; where N is the number of words in the preset bad vocabulary comparison table;

Divide the total value of similarity by N to obtain the mean value of similarity between the candidate words and the words in the predetermined bad vocabulary comparison table.

Specifically, the remaining words after screening in the word segmentation set are used as candidate words, each candidate word is quantified to obtain the word vector of the candidate word, and the words in the preset bad vocabulary comparison table are vectorized in advance to obtain The preset word vector set of bad words, taking the word vector of any candidate word as an example, the word vector of the candidate word is similar to each bad word vector in the preset bad word word vector set through words The degree calculation formula performs similarity calculation to obtain N similarity values, where N is the number of words in the preset bad vocabulary comparison table, and then the N similarity values are added and averaged, which is the candidate to be selected The mean value of similarity between words and the words in the predetermined bad vocabulary comparison table. Each candidate word is calculated according to the above method to obtain the mean value of similarity.

Among them, the formula for calculating word similarity is:

Among them, W1 is the word vector of the word to be selected, W2 is any word vector in the preset word vector set of bad words, n is the word vector dimension, W1 _i is the value of W1 in the i dimensions of W1, and W2 _i is W2 is the value of W2 in i dimensions.

The similarity threshold range is preset, and words that meet the preset similarity threshold range are selected from the remaining words in the word segmentation set and loaded into the first bad vocabulary list.

In step S140, through the sentiment analysis algorithm, words that do not meet the preset sentiment trend rule of the undesirable words are screened out from the first unhealthy vocabulary to obtain the second unhealthy vocabulary.

The bad words selected by the similarity may have non-bad words, so it is necessary to screen the words in the first bad vocabulary to deal with non-bad words, through the sentiment analysis algorithm (its English abbreviation is SO-PMI algorithm), from the first A bad vocabulary is used to filter out words that do not satisfy the preset bad words sentiment tendency rule, where the sentiment analysis algorithm (its English abbreviation is SO-PMI algorithm) is a point mutual information algorithm, which is used to calculate the value of the word sentiment tendency strength.

Among them, through the sentiment analysis algorithm, the words that do not meet the preset emotional tendency rules of the bad words are filtered out from the first bad vocabulary, and the steps of obtaining the second bad vocabulary include:

Perform vectorization processing on the words in the first bad vocabulary to obtain the word vector to be calculated;

The word co-occurrence frequency calculation formula is used to calculate the word co-occurrence frequency of the word vector to be calculated and the word vector in the pre-built civilized vocabulary and the word co-occurrence frequency of the word vector to be calculated and the word vector in the pre-built uncivilized vocabulary, as Co-occurrence frequency of unused words;

According to the co-occurrence frequency of the words to be used, the emotional tendency intensity value of each word in the first bad vocabulary is calculated through the calculation formula of sentiment analysis;

Compare the emotional tendency intensity value of each word in the first bad vocabulary list with the preset emotional tendency intensity threshold rule, and filter out the words in the first bad vocabulary that do not meet the preset bad word emotional tendency rule according to the emotional tendency intensity threshold rule , Get the second bad vocabulary list.

Among them, the calculation formula of word co-occurrence frequency is:

Among them, F(N1, N2) refers to the frequency at which N1 and N2 appear simultaneously in a window of a set size in all n articles, F(N1), F(N2) refers to all n articles The frequency at which N1 and N2 appear respectively.

Among them, the calculation formula of sentiment analysis is:

Among them, Q is the words in the first bad vocabulary, Cwords is the pre-built civilized vocabulary, Iwords is the pre-built uncivilized vocabulary, PMI (Q, cword) is the words in the first bad vocabulary and the pre-built civilized vocabulary Co-occurrence frequency of word vectors in the library, PMI(Q, Iword) is the co-occurrence frequency of words in the first bad vocabulary and word vectors in the pre-built uncivilized vocabulary, SO-PMI(Q) is the first bad word The value of the emotional tendency intensity of the word Q in the table.

Preferably, the threshold rule for the intensity of sentimentality is:

If the emotional tendency intensity value of the words in the first bad vocabulary list is greater than or equal to zero, the word is a word that does not meet the preset bad words emotional tendency rule;

If the emotional tendency intensity value of the words in the first bad vocabulary list is less than zero, the words are words that meet the preset bad words emotional tendency rules.

Specifically, the use of sentiment analysis algorithm to determine the polarity of words is based on the polarity of large-scale corpus mining words, and the unregistered words are judged based on the frequency of unregistered words co-occurring with existing vocabulary whose polarity has been determined polarity. Word co-occurrence means that two words appear at the same time in a certain word window.

For example, we generally describe a person as both "active" and "optimistic", and we rarely say that a person is both "active" and "frustrated". This is the point mutual relationship between two words, that is, the degree of association between the two words, that is, the PMI value (word co-occurrence value). PMI is the point mutual information between two random variables.

Using sentiment analysis algorithms to determine the polarity of words requires the construction of a seed vocabulary, pre-built uncivilized vocabulary as a bad seed vocabulary, including the same number of civilized vocabulary as the pre-built civilized vocabulary, and then calculated according to the sentiment analysis calculation formula The emotional tendency intensity value of the word w, and then judge the possibility that the word is a bad vocabulary according to the emotional tendency intensity threshold rule.

In step S150, words that do not meet the position structure of the bad vocabulary sentence are screened out from the second bad vocabulary list through the word position structure method, and the third bad vocabulary list is obtained and output.

In order to further filter out the non-bad words in the second bad vocabulary list, it is necessary to further filter out the words in the second bad vocabulary list. The word position structure method is used to filter out the bad words from the second bad vocabulary list. Words conforming to the positional structure of the bad vocabulary sentence get the third bad vocabulary list.

Among them, through the word position structure method, the steps of screening words that do not conform to the position structure of the bad vocabulary sentence from the second bad vocabulary include:

Compare the words in the second bad vocabulary list with the sentence position structure where bad words in the pre-built bad vocabulary sentence template are located;

From the second bad vocabulary list, words that do not conform to the sentence position structure of the bad vocabulary in the bad vocabulary sentence template are filtered out to obtain the third bad vocabulary list.

Specifically, in the process of calculating the two words w1 and w2 with the word similarity, many "impurities" are often introduced, that is, the two words have a high similarity, but they do not express the same meaning, such as "mental disability" and The similarity of the two words "barrier" is as high as 0.5324, but it is clear that "mental disability" is an uncivilized word, and "barrier" is a neutral word. In order to reduce the introduction of such "impurities", a method for judging the position and structure of words is designed. For example, when an uncivilized user looks like a person, "you are really mentally handicapped" and "you are really mentally handicapped", but we will not describe a person as "you are really a handicap". Therefore, the judgment of word position structure is a good supplement to the calculation of word similarity. Although there will be sentences like "You are really a good person" with the same pattern but completely different semantics, the similarity between "good guy" and "mentally retarded" is very low, so the location structure judgment method is based on the uncivilized with high similarity. Words make further judgments, so the above-mentioned situations are generally not encountered.

The specific method is based on the part-of-speech tagging and syntactic analysis functions provided by the word segmentation tool. Take "You are really mentally retarded" as an example. Its part-of-speech tagging is: you(r)/真是(d)/个(q)/mentally retarded( n). This sentence pattern and part of speech structure can be included as a template. The part-of-speech tag of "this is an obstacle" is: this (r)/is (v)/a (q)/obstacle (n). The word handicap and the word mentally handicapped are used differently in terms of syntactic structure and part-of-speech structure. Therefore, the introduction of such "impurities" can be reduced according to the summarized lexical-syntactic part-of-speech structure template.

Example 2

This application provides a method for monitoring bad network data, which is applied to an electronic device 1. Referring to FIG. 2, it is a schematic diagram of the application environment of the preferred embodiment of the method for monitoring network bad data according to the present application.

In this embodiment, the electronic device 1 may be a terminal device with a computing function such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.

The electronic device 1 includes a processor 12, a memory 11, a network interface 13, and a communication bus 14.

The memory 11 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory 11, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 1, for example, the hard disk of the electronic device 1. In other embodiments, the readable storage medium may also be the external memory 11 of the electronic device 1, such as a plug-in hard disk, a smart media card (SMC), and a secure digital (Secure Digital) equipped on the electronic device 1. , SD) card, flash card (Flash Card), etc.

In this embodiment, the readable storage medium of the memory 11 is generally used to store the network bad data monitoring program 10 installed in the electronic device 1, a preset bad word comparison table, and the like. The memory 11 can also be used to temporarily store data that has been output or will be output.

In some embodiments, the processor 12 may be a central processing unit (CPU), a microprocessor, or other data processing chip, which is used to run program codes or process data stored in the memory 11, for example, execute network bad data. Monitoring program 10 etc.

The network interface 13 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the electronic device 1 and other electronic devices.

The communication bus 14 is used to realize the connection and communication between the above-mentioned components.

FIG. 2 only shows the electronic device 1 with the components 11-14, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.

In the device embodiment shown in FIG. 2, the memory 11 as a computer storage medium may include an operating system and a network bad data monitoring program 10; when the processor 12 executes the network bad data monitoring program 10 stored in the memory 11, The steps of the method for monitoring bad network data in Embodiment 1 are implemented, as shown in Fig. 1 for example. Alternatively, the processor 12 implements the functions of the modules/units in the foregoing device embodiments when executing the network bad data monitoring method. For example, the network bad data monitoring program 10 shown in FIG. 3 can be divided into: a word segmentation processing module 110, bad words The screening module 120, the word similarity calculation module 130, the sentiment analysis module 140, and the word location structure screening module 150.

The functions or operation steps implemented by the modules 110-150 are all similar to the above, and will not be described in detail here. Illustratively, for example:

The word segmentation processing module 110 is used to perform word segmentation processing on the target text to obtain a word segmentation set.

Bad word screening module 120: used to compare words in the word segmentation set with a preset bad vocabulary comparison table, filter bad words from the word segmentation set, load the bad words into the first bad vocabulary list, and filter out the word segmentation set The remaining words as candidates for selection.

Word similarity calculation module 130: used to calculate the average similarity between each candidate word and the words in the preset bad vocabulary comparison table through the word similarity calculation formula, and calculate the average similarity greater than the preset similarity threshold for candidates The words are loaded into the first bad vocabulary list.

Sentiment analysis module 140: used to filter out words that do not meet the preset emotional tendency rule of bad words from the first bad vocabulary through the sentiment analysis algorithm to obtain the second bad vocabulary.

Word position structure screening module 150: used to filter out words that do not conform to the position structure of the bad vocabulary sentence from the second bad vocabulary list through the word position structure method to obtain and output the third bad vocabulary list.

Example 3

Corresponding to the above method, the embodiment of the present application also proposes a network bad data monitoring system 400, which includes a word segmentation processing unit 410, a bad word screening unit 420, a word similarity calculation unit 430, an sentiment analysis unit 440, and word location structure screening Unit 450, in which the word segmentation processing unit 410, the bad word screening unit 420, the word similarity calculation unit 430, the sentiment analysis unit 440, and the word location structure screening unit 450 realize the functions and the steps of the network bad data monitoring method in the embodiment one by one correspond.

The word segmentation processing unit 410 is configured to perform word segmentation processing on the target text to obtain a word segmentation set;

The bad word screening unit 420 is used to compare words in the word segmentation set with a preset bad vocabulary comparison table, filter out bad words from the word segmentation set, load the bad words into the first bad vocabulary list, and filter out the word segmentation set The remaining words of as candidates for selection;

The word similarity calculation unit 430 is used to calculate the average similarity between each candidate word and the words in the preset bad vocabulary comparison table through a word similarity calculation formula, and to select candidates whose average similarity is greater than the preset similarity threshold Words are loaded into the first bad vocabulary list;

The sentiment analysis unit 440 is configured to filter out words that do not meet the preset sentiment trend rule of the bad words from the first bad vocabulary through the sentiment analysis algorithm to obtain the second bad vocabulary;

The word position structure screening unit 450 is used to filter out words that do not meet the position structure of the bad vocabulary sentence from the second bad vocabulary list through the word position structure method to obtain and output the third bad vocabulary list.

Example 4

The embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile; the computer-readable storage medium includes a network bad data monitoring program, so The network bad data monitoring program is executed by the processor to implement the network bad data monitoring method in Embodiment 1. In order to avoid repetition, it will not be repeated here. Or, when the computer program is executed by the processor, the function of each module/unit in the network bad data monitoring system in Embodiment 4 is realized. To avoid repetition, it will not be repeated here.

The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the foregoing network bad data monitoring method, electronic device, and system, and will not be repeated here.

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.

The serial numbers of the foregoing embodiments of the present application are for description only, and do not represent the superiority or inferiority of the embodiments. Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including several instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.

The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A method for monitoring bad network data, applied to an electronic device, wherein the method includes:

Perform word segmentation processing on the target text to obtain a word segmentation set;

Compare the words in the word segmentation set with a preset bad vocabulary comparison table, filter out bad words from the word segmentation set, load the bad words into the first bad vocabulary list, and filter out the word segmentation set The remaining words of as candidates for selection;

Through the word similarity calculation formula, calculate the average similarity of each candidate word and the words in the preset bad vocabulary comparison table, and load the candidate words with the average similarity greater than the preset similarity threshold to the The first bad vocabulary list;

Through the sentiment analysis algorithm, the words that do not meet the preset sentiment trend rule of the bad words are screened out from the first bad vocabulary to obtain the second bad vocabulary;

Through the word position structure method, words that do not conform to the position structure of the bad vocabulary sentence are screened out from the second bad vocabulary list, and the third bad vocabulary list is obtained and output.
The method for monitoring network bad data according to claim 1, wherein the step of calculating the mean similarity between each candidate word and the words in the preset bad word comparison table by using a word similarity calculation formula comprises:

Performing vectorization processing on each candidate word to obtain a word vector of the candidate word;

The word vector of each word to be selected and the bad word vector in the preset bad word word vector set are calculated by the word similarity calculation formula to calculate the similarity to obtain N similarity values, where the preset The word vector set of bad words is a word vector set obtained by vectorizing words in the preset bad word comparison table;

According to the N similarity values, the mean value of the similarity between the candidate words and the words in the predetermined bad vocabulary comparison table is obtained.
The method for monitoring network bad data according to claim 2, wherein said obtaining the mean value of similarity between the candidate words and the words in the predetermined bad word comparison table according to the N similarity values comprises:

The N similarity values are added and processed to obtain a total similarity value; wherein, the N is the number of words in the preset bad vocabulary comparison table;

The total value of similarity is divided by N to obtain the mean value of similarity between the candidate words and the words in the predetermined bad vocabulary comparison table.
The method for monitoring network bad data according to claim 1, wherein the word similarity calculation formula is:

Among them, W1 is the word vector of the word to be selected, W2 is any word vector in the preset word vector set of bad words, n is the word vector dimension, W1 i is the value of W1 in the i dimensions of W1, and W2 i is W2 is the value of W2 in i dimensions.
The method for monitoring network bad data according to claim 1, wherein words that do not meet the preset bad words emotional tendency rule are filtered out from the first bad vocabulary through an sentiment analysis algorithm to obtain the second bad vocabulary The steps include:

Performing vectorization processing on words in the first bad vocabulary list to obtain word vectors to be calculated;

The word co-occurrence frequency of the word vector to be calculated and the word vector in the pre-built civilized vocabulary and the word co-occurrence frequency of the word vector to be calculated and the word vector in the pre-built uncivilized vocabulary are respectively calculated by the word co-occurrence frequency calculation formula. Current frequency, as the co-occurrence frequency of the word to be used;

According to the co-occurrence frequency of the words to be used, the emotional tendency intensity value of each word in the first bad vocabulary is calculated through an emotional analysis calculation formula;

Compare the emotional tendency intensity value of each word in the first bad vocabulary list with a preset emotional tendency intensity threshold rule, and screen out the first bad vocabulary list that does not meet the preset badness according to the emotional tendency intensity threshold rule Words with emotions tend to be regular words, and get the second bad vocabulary list.
The method for monitoring network bad data according to claim 5, wherein the formula for calculating the word co-occurrence frequency is:

Among them, F(N1, N2) refers to the frequency at which N1 and N2 appear simultaneously in a window of a set size in all n articles, F(N1), F(N2) refers to all n articles The frequency at which N1 and N2 appear respectively.
The method for monitoring network bad data according to claim 5, wherein the sentiment analysis calculation formula is:

Among them, Q is the words in the first bad vocabulary, Cwords is the pre-built civilized vocabulary, Iwords is the pre-built uncivilized vocabulary, PMI (Q, cword) is the words in the first bad vocabulary and the pre-built civilized vocabulary Co-occurrence frequency of word vectors in the library, PMI(Q, Iword) is the co-occurrence frequency of words in the first bad vocabulary and word vectors in the pre-built uncivilized vocabulary, SO-PMI(Q) is the first bad word The value of the emotional tendency intensity of the word Q in the table.
The method for monitoring network bad data according to claim 5, wherein the emotional tendency intensity threshold rule is:

If the emotional tendency intensity value of the words in the first bad vocabulary list is greater than or equal to zero, the words are words that do not meet the preset bad word emotional tendency rules;

If the emotional tendency intensity value of a word in the first bad vocabulary list is less than zero, the word is a word that satisfies the preset bad word emotional tendency rule.
The method for monitoring network bad data according to claim 1, wherein the step of screening words that do not conform to the position structure of bad vocabulary sentences from the second bad vocabulary through the word position structure method comprises:

Comparing the words in the second bad vocabulary list with the sentence position structure where the bad words in the pre-built bad vocabulary sentence template are located;

From the second bad vocabulary list, words that do not conform to the sentence position structure of the bad vocabulary in the bad vocabulary sentence template are filtered out to obtain a third bad vocabulary list.
The method for monitoring network bad data according to claim 1, wherein the words in the word segmentation set are compared with a preset bad vocabulary comparison table, bad words are screened out from the word segmentation set, and the bad words are loaded To the first bad vocabulary list, the step of using the remaining words filtered in the word segmentation set as candidate words includes:

Input the words in the word segmentation set and the preset bad vocabulary comparison table into a preset same word screening model, and filter bad words from the word segmentation set through the preset same word screening model;

The bad words are loaded into the first bad vocabulary list, and the remaining words after screening in the word segmentation set are used as candidate words.
The method for monitoring network bad data according to claim 10, wherein the preset same word screening model comprises:

A first input layer for inputting words in the word segmentation set, a second input layer for inputting a preset bad vocabulary comparison table, and words input by the first input layer and the second input layer The same word screening layer for comparing and analyzing the input preset bad vocabulary comparison table, the first output layer for outputting bad words selected from the word segmentation set in the same word screening layer, and the first output layer for comparing all the bad words. The second output layer for outputting the remaining words after filtering out bad words from the word segmentation layer in the same word filtering layer.
An electronic device, wherein the electronic device includes a memory and a processor, and a network bad data monitoring program is stored in the memory, and the following steps are implemented when the network bad data monitoring program is executed by the processor:

Perform word segmentation processing on the target text to obtain a word segmentation set;

Compare the words in the word segmentation set with a preset bad vocabulary comparison table, filter out bad words from the word segmentation set, load the bad words into the first bad vocabulary list, and filter out the word segmentation set The remaining words of as candidates for selection;

Through the word similarity calculation formula, calculate the average similarity of each candidate word and the words in the preset bad vocabulary comparison table, and load the candidate words with the average similarity greater than the preset similarity threshold to the The first bad vocabulary list;

Through the sentiment analysis algorithm, the words that do not meet the preset sentiment trend rule of the bad words are screened out from the first bad vocabulary to obtain the second bad vocabulary;

Through the word position structure method, words that do not conform to the position structure of the bad vocabulary sentence are screened out from the second bad vocabulary list, and the third bad vocabulary list is obtained and output.
11. The electronic device according to claim 12, wherein the step of calculating the mean similarity between each candidate word and the words in the predetermined bad vocabulary comparison table through a word similarity calculation formula comprises:

Performing vectorization processing on each candidate word to obtain a word vector of the candidate word;

The word vector of each word to be selected and the bad word vector in the preset bad word word vector set are calculated by the word similarity calculation formula to calculate the similarity to obtain N similarity values, where the preset The word vector set of bad words is a word vector set obtained by vectorizing words in the preset bad word comparison table;

According to the N similarity values, the mean value of the similarity between the candidate words and the words in the predetermined bad vocabulary comparison table is obtained.
The electronic device according to claim 13, wherein said obtaining the mean value of similarity between the candidate words and the words in the predetermined bad vocabulary comparison table according to the N similarity values comprises:

The N similarity values are added and processed to obtain a total similarity value; wherein, the N is the number of words in the preset bad vocabulary comparison table;

The total value of similarity is divided by N to obtain the mean value of similarity between the candidate words and the words in the predetermined bad vocabulary comparison table.
The electronic device according to claim 12, wherein the word similarity calculation formula is:

Among them, W1 is the word vector of the word to be selected, W2 is any word vector in the preset word vector set of bad words, n is the word vector dimension, W1 i is the value of W1 in the i dimensions of W1, and W2 i is W2 is the value of W2 in i dimensions.
11. The electronic device according to claim 12, wherein the step of filtering out words that do not meet the preset emotional tendency rule of bad words from the first bad vocabulary through an sentiment analysis algorithm, and obtaining the second bad vocabulary comprises:

Performing vectorization processing on words in the first bad vocabulary list to obtain word vectors to be calculated;

The word co-occurrence frequency of the word vector to be calculated and the word vector in the pre-built civilized vocabulary and the word co-occurrence frequency of the word vector to be calculated and the word vector in the pre-built uncivilized vocabulary are respectively calculated by the word co-occurrence frequency calculation formula. Current frequency, as the co-occurrence frequency of the word to be used;

According to the co-occurrence frequency of the words to be used, the emotional tendency intensity value of each word in the first bad vocabulary is calculated through an emotional analysis calculation formula;

Compare the emotional tendency intensity value of each word in the first bad vocabulary list with a preset emotional tendency intensity threshold rule, and screen out the first bad vocabulary list that does not meet the preset badness according to the emotional tendency intensity threshold rule Words with emotions tend to be regular words, and get the second bad vocabulary list.
The electronic device according to claim 16, wherein the formula for calculating the word co-occurrence frequency is:

Among them, F(N1, N2) refers to the frequency at which N1 and N2 appear simultaneously in a window of a set size in all n articles, F(N1), F(N2) refers to all n articles The frequency at which N1 and N2 appear respectively.
The electronic device according to claim 16, wherein the emotion analysis calculation formula is:

Among them, Q is the words in the first bad vocabulary, Cwords is the pre-built civilized vocabulary, Iwords is the pre-built uncivilized vocabulary, PMI (Q, cword) is the words in the first bad vocabulary and the pre-built civilized vocabulary Co-occurrence frequency of word vectors in the library, PMI(Q, Iword) is the co-occurrence frequency of words in the first bad vocabulary and word vectors in the pre-built uncivilized vocabulary, SO-PMI(Q) is the first bad word The value of the emotional tendency intensity of the word Q in the table.
A monitoring system for network bad data, which includes:

The word segmentation processing unit is used to perform word segmentation processing on the target text to obtain a word segmentation set;

The bad word screening unit is used to compare words in the word segmentation set with a preset bad vocabulary comparison table, filter bad words from the word segmentation set, load the bad words into the first bad vocabulary list, and load the bad words into the first bad vocabulary list. The remaining words after screening in the word segmentation set are used as candidate words;

The word similarity calculation unit is used to calculate the average similarity between each of the candidate words and the words in the preset bad vocabulary comparison table through a word similarity calculation formula, and make the average similarity greater than the preset similarity threshold Load the candidate words of to the first bad vocabulary list;

The sentiment analysis unit is used to screen out words that do not meet the preset sentiment trend rule of undesirable words from the first unhealthy vocabulary through an sentiment analysis algorithm to obtain a second unhealthy vocabulary;

The word position structure screening unit is used to filter out words that do not conform to the position structure of the bad vocabulary sentence from the second bad vocabulary list through the word position structure method to obtain and output the third bad vocabulary list.
A computer-readable storage medium, wherein a network bad data monitoring program is stored in the computer-readable storage medium, and when the network bad data monitoring program is executed by a processor, it implements any one of claims 1 to 8 The steps of the method for monitoring bad network data.