CN110929506A

CN110929506A - Junk information detection method, device and equipment and readable storage medium

Info

Publication number: CN110929506A
Application number: CN201911227487.6A
Authority: CN
Inventors: 范如; 范渊
Original assignee: Hangzhou Dbappsecurity Technology Co Ltd
Current assignee: Hangzhou Dbappsecurity Technology Co Ltd
Priority date: 2019-12-04
Filing date: 2019-12-04
Publication date: 2020-03-27

Abstract

The application discloses a junk information detection method, a junk information detection device, junk information detection equipment and a readable storage medium. The method disclosed by the application is applied to a server and comprises the following steps: acquiring user generated content of a client; respectively carrying out normalization processing and word segmentation processing on the user generated content to obtain to-be-detected words; mapping the word segmentation to be detected into a vector to be detected by using a Skip-Gram model; and detecting the vector to be detected by using an LSTM model to obtain a detection result. The Skip-Gram model in the application can accurately express semantic information, and the LSTM model has a gradient transfer characteristic, so that a certain single participle can be analyzed, relevance information among different participles can be considered, and the accuracy of a detection result can be improved. Accordingly, the junk information detection device, the junk information detection equipment and the readable storage medium have the technical effects.

Description

Junk information detection method, device and equipment and readable storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a readable storage medium for detecting spam.

Background

For the detection of user generated content, the regular expression is mostly used for matching keywords, but the accuracy of the method depends on the quality of the matching strategy, so that the accuracy of the detection result cannot be guaranteed. The user generated content is the content which is formed by the user self-output editing at the client.

At present, NBOW (Neural Bag-of-Words) and CNN (convolutional Neural Network) may also be used to detect spam in the user-generated content. However, NBOW and CNN only extract and analyze local information of the text, which results in text information loss, and thus the accuracy of the detection result is not high.

Therefore, how to improve the detection accuracy of the spam information is a problem to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, an object of the present application is to provide a method, an apparatus, a device and a readable storage medium for detecting spam, so as to improve the accuracy of detecting spam. The specific scheme is as follows:

in a first aspect, the present application provides a spam detection method, applied to a server, including:

acquiring user generated content of a client;

respectively carrying out normalization processing and word segmentation processing on the user generated content to obtain to-be-detected words;

mapping the word segmentation to be detected into a vector to be detected by using a Skip-Gram model;

and detecting the vector to be detected by using an LSTM model to obtain a detection result.

Preferably, obtaining the user-generated content of the client comprises:

and acquiring the user generated content from the client according to the input rule of the LSTM model.

Preferably, the training process of the LSTM model includes:

acquiring a training word vector;

randomly selecting a plurality of training samples with preset sizes from the training word vectors;

processing each training sample by using the current LSTM model to obtain a plurality of training results;

calculating the error of the current training result and the real detection result of the training sample corresponding to the current training result aiming at any training result to obtain a plurality of errors;

determining the average value of the plurality of errors as the error value of the current LSTM model;

and if the error value is smaller than the preset error threshold value, determining the current LSTM model as the LSTM model.

Preferably, the method further comprises the following steps:

and if the error is not smaller than the preset error threshold value, a step of randomly selecting a plurality of training samples with preset sizes from the training word vector is executed.

Preferably, the method further comprises the following steps:

and calculating the accuracy of the detection result, and if the accuracy is lower than a preset accuracy threshold, performing thermal update on the LSTM model.

Preferably, the LSTM model comprises: an input layer, a hidden layer, and an output layer, each added with a Dropout operation.

Preferably, the server and client communicate through the gRPC.

In a second aspect, the present application provides a spam detection apparatus, applied to a server, including:

the acquisition module is used for acquiring user generated content of the client;

the preprocessing module is used for respectively carrying out normalization processing and word segmentation processing on the user generated content to obtain to-be-detected words;

the vectorization module is used for mapping the participles to be detected into vectors to be detected by using a Skip-Gram model;

and the detection module is used for detecting the vector to be detected by using the LSTM model to obtain a detection result.

In a third aspect, the present application provides a spam detection apparatus, including:

a memory for storing a computer program;

and the processor is used for executing the computer program to realize the spam detection method disclosed by the foregoing.

In a fourth aspect, the present application provides a readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the spam detection method disclosed in the foregoing.

According to the scheme, the application provides a junk information detection method, which is applied to a server and comprises the following steps: acquiring user generated content of a client; respectively carrying out normalization processing and word segmentation processing on the user generated content to obtain to-be-detected words; mapping the word segmentation to be detected into a vector to be detected by using a Skip-Gram model; and detecting the vector to be detected by using an LSTM model to obtain a detection result.

According to the method, after the user generated content is obtained, firstly, normalization processing and word segmentation processing are respectively carried out on the user generated content to obtain the word to be detected; then mapping the word segmentation to be detected into a vector to be detected by using a Skip-Gram model; and finally, detecting the vector to be detected by using an LSTM model to obtain a detection result. The Skip-Gram model can accurately express semantic information, the LSTM model has a gradient transfer characteristic, namely the output of any neuron in the LSTM model can be input into other neurons, so that the processing of current data processed by other neurons is related to historical data, therefore, the LSTM model can analyze a single participle and can also consider the relevance information among different participles, and the accuracy of a detection result can be improved.

Correspondingly, the junk information detection device, the junk information detection equipment and the readable storage medium have the technical effects.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a flowchart of a first spam detection method disclosed in the present application;

FIG. 2 is a flow chart of the training of an LSTM model disclosed herein;

FIG. 3 is a flow chart of word vector training disclosed herein;

FIG. 4 is a flow chart illustrating the training of another LSTM model disclosed herein;

FIG. 5 is a schematic view of an anti-spam system according to the present disclosure;

FIG. 6 is a schematic diagram of a spam detection apparatus according to the present disclosure;

fig. 7 is a schematic diagram of a spam detection device disclosed in the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

At present, NBOW and CNN only extract and analyze local information of a text, which results in text information loss, and thus the accuracy of a detection result is not high. Therefore, the application provides a junk information detection scheme, and the detection accuracy of the junk information can be improved.

Referring to fig. 1, an embodiment of the present application discloses a spam detection method, which is applied to a server and includes:

s101, obtaining user generated content of the client.

It should be noted that, spam may exist in the user-generated content that is automatically edited by the user at the client, such as: spam, fraud information, yellow or storm related information, lottery promotions, etc. In order to sanitize the network environment, it is necessary to detect spam in user-generated content in order to delete it.

In one embodiment, obtaining user-generated content at a client comprises: and acquiring the user-generated content from the client according to the input rule of an LSTM (Long short-Term Memory network) model. The data content that the LSTM model can handle may be preset, such as: the LSTM model has a higher detection accuracy for positive spam, then the input rules can be set to: advertising content, then most of the advertising content is available from the client.

And S102, respectively carrying out normalization processing and word segmentation processing on the user generated content to obtain the word to be detected.

Because the text of the user-generated content contains various words, redundant punctuations and meaningless stop words, the text of the user-generated content can be normalized first, and the normalization process specifically comprises the following steps: Chinese-English conversion, traditional-simplified conversion (traditional-simplified conversion), punctuation conversion (full angle to half angle), special character conversion and the like, so that the text of the content generated by the user is more standard. The text of the user generated content is normalized, and Mars, Sicilier characters and the like hidden in the user generated content can be detected.

And further performing word segmentation on the text after the normalization processing. The word segmentation process can be performed by using a common word segmentation tool. And obtaining a segmentation set after the segmentation as the segmentation to be detected. In order to avoid losing key information as much as possible, the whole word segmentation set obtained after word segmentation can be determined as the word to be detected. Of course, in order to improve the detection efficiency, screening conditions may be set, such as: and taking the words with the word frequency not less than 3 in the word segmentation set as the to-be-detected word segmentation, and uniformly replacing the words with the word frequency less than 3 with target words (such as UKN). Thus, even the words appearing only once or twice can be considered, thereby improving the detection accuracy.

S103, mapping the participles to be detected into vectors to be detected by using a Skip-Gram model.

And S104, detecting the vector to be detected by using an LSTM model to obtain a detection result.

In this embodiment, the method further includes: and calculating the accuracy of the detection result, and if the accuracy is lower than a preset accuracy threshold, performing thermal update on the LSTM model. Such as: replacing the currently running model with the latest version of the model, or replacing the currently running model with the last version of the model.

Specifically, the LSTM model includes: an input layer, a hidden layer, and an output layer, each added with a Dropout operation. Since the output of any neuron in the LSTM model may be input to other neurons, if the interaction between different neurons is large, the operation efficiency of the whole model will be reduced. To control the interaction between different neurons, the magnitude of the specific gravity of the interaction between different neurons in the same layer can be determined using the Dropout operation. Such as: the hidden layer has 30 neurons, the probability that the output of the 5 th neuron is processed by the 10 th neuron is 0.8, the probability that the output of the 5 th neuron is processed by the 11 th neuron is 0.7, and the like.

In this embodiment, the server and the client communicate through a gRPC (google Remote Procedure Call). The transmission protocol used by the gPC is HTTP/2, and a header compression algorithm (HPACK) and a binary message packing manner used in the HTTP/2 can enable the transmission performance between the client and the server to be better.

Referring to fig. 2, fig. 2 is a flow chart of the training of the LSTM model. The training process of the LSTM model comprises the following steps:

s201, obtaining a training word vector;

s202, randomly selecting a plurality of training samples with preset sizes from the training word vectors;

s203, processing each training sample by using the current LSTM model to obtain a plurality of training results;

s204, aiming at any training result, calculating the error of the current training result and the real detection result of the training sample corresponding to the current training result to obtain a plurality of errors;

s205, determining the average value of a plurality of errors as the error value of the current LSTM model;

s206, judging whether the error value is smaller than a preset error threshold value or not; if yes, executing S207; if not, executing S202;

and S207, determining the current LSTM model as the LSTM model.

Specifically, the training word vector includes a plurality of text sequences. The preset size may be 128 pieces, and 100 text sequences of 128 pieces may be randomly selected, so as to obtain 100 training samples. The first training sample is processed using the initial LSTM model to obtain a training result, which corresponds to an error. And further processing a second training sample by using the current LSTM model to obtain a second training result, wherein the second training result corresponds to an error, and 100 errors can be obtained by performing the process for 100 times, so that the average value of the 100 errors is the error value of the LSTM model when the 100 th processing is completed. If the error value is smaller than a preset error threshold value, the current LSTM model meets the preset requirement, and the current LSTM model is determined as the LSTM model; otherwise, continuing to train the current LSTM model according to the process until the LSTM model reaches the preset requirement.

The generation process of the training word vector comprises the following steps: acquiring training data, performing normalization processing and word segmentation processing on the training data, screening words, mapping the screened words into vectors by using a Skip-Gram model, and recording the mapping relation between each vector and each word to obtain a mapping dictionary.

It should be noted that the LSTM model includes: an input layer, a hidden layer, and an output layer. Because the transmission of text information is dynamically presented, the occurrence of the latter word is usually dependent on the semantic transmission of the words in front of the former word, so that the LSTM model can link the context semantics of the sentence, fully considers the semantic features and context correlation of the words, and is suitable for detecting the junk information in the content generated by the user.

As can be seen, after the user generated content is obtained, normalization processing and word segmentation processing are respectively performed on the user generated content to obtain to-be-detected word segments; then mapping the word segmentation to be detected into a vector to be detected by using a Skip-Gram model; and finally, detecting the vector to be detected by using an LSTM model to obtain a detection result. The Skip-Gram model can accurately express semantic information, the LSTM model has a gradient transfer characteristic, namely the output of any neuron in the LSTM model can be input into other neurons, so that the processing of current data processed by other neurons is related to historical data, therefore, the LSTM model can analyze a single participle and can also consider the relevance information among different participles, and the accuracy of a detection result can be improved.

The following embodiments can be realized in accordance with the methods provided herein.

(1) Referring to fig. 3, the word vector training process specifically includes:

various types of text (including spam text and non-spam text) are collected as much as possible. Training the LSTM model with large-scale text may cause content that is not identified to also be identified. For example: the junk text "add my believe with benefit" variant "add my Wei with benefit" where "believe with little" is similar to "Wei believe". The distance between the words "WeChat" and "Wei Xin" trained by the words vector is similar in space, and the words "Critical", "Wei Xin" and "micro " are similar in space. That is, if "plus me Wechat has benefits" is marked as a negative example, then "plus me Wechat has benefits" can also be accurately identified as a negative example.

All texts were cleaned (i.e. normalized), i.e.: the original text is subjected to Chinese-English conversion, traditional Chinese-simplified conversion (traditional Chinese-simplified conversion), punctuation conversion (full angle-half angle conversion), special character conversion and the like, so that the text of the original text is more standard.

And after the cleaning is finished, performing word segmentation on the text. The words with the word frequency not less than 3 are added into the training set, and the words with the word frequency less than 3 are uniformly replaced by the UNK, so that the words with the word frequency less than 3 are added into the training set even if the words appear for one time or two times, and the generalization capability of the model can be improved.

After Word segmentation, the training set is mapped into a mapping dictionary by using a Skip-Gram model in the Word2Vec algorithm, namely, each Word in the training set is mapped into a vector, and a mapping relation between each vector and each Word in the training set is constructed. Wherein, the Skip-Gram model considers the influence of other words on the current word in the context of each word.

(2) Referring to fig. 4, the training process of the LSTM model specifically includes:

dividing the obtained training set into training samples and test samples according to the proportion of 7:3, and removing labels in the test samples. And training the LSTM model by using the training sample, and testing the LSTM model by using the test sample after the training is finished, so that the detection capability of the trained LSTM model can be determined. Wherein the LSTM model comprises: an input layer, a hidden layer, and an output layer, each added with a Dropout operation.

Each time, randomly extracting 128 text sequences from the training samples to be used as a batch, and inputting the batch into the LSTM model; calculating the error between the training result of each batch and the real result of the current batch; determining the average value of all errors as the error value of the LSTM model when the current training process is finished, if the error value is smaller than a preset error threshold value, indicating that the current LSTM model meets the preset requirement, and determining the current LSTM model as the LSTM model; otherwise, continuing to train the current LSTM model according to the process until the LSTM model reaches the preset requirement.

(3) Referring to fig. 5, the trained LSTM model is used to deploy and obtain the anti-spam system.

In fig. 5, the front-end machine is the client and the background is the server. Anti rubbish system includes: the system comprises a client and a server, wherein the server is deployed on the basis of Tensflow Serving, the Tensflow Serving is a flexible and high-performance service system aiming at a machine learning model, and various deep learning models realized by Tensflow can be deployed. The method supports two communication modes of a gRPC (graphical user interface) and a RESTful API (representational context API) between a client and a server.

The anti-spam service in the anti-spam system is divided into an online layer and a offline layer, the online real-time service requires millisecond-level judgment on whether a text to be detected belongs to spam information, and the offline service needs to update the LSTM model in time according to the currently detected text.

The server deployed based on Tensflow Serving supports thermal updating, is stable in service and simple in interface, can synchronously perform offline training and online detection of the whole anti-garbage system, can quickly load iteration when the anti-garbage system needs to improve a model in the face of new data identification, can automatically detect a model of the latest version or a model of the previous version, and is very suitable for a high-frequency iteration scene required by the anti-garbage system.

The concrete implementation of the server deployed based on the Tensorflow Serving includes: 1) deploying the LSTM model: specifying the inputs and outputs of the LSTM model, the inputs are: predicting which metadata to contain at one time; outputting that: how to process after obtaining the detection result. 2) gPC communication between the server and the client is set.

Therefore, according to the embodiment, the anti-spam system can be deployed, in the system, the server can detect the user generated content generated by the client in real time, and can efficiently and accurately identify the spam information, so that on one hand, the investment of manual supervision and audit is greatly reduced, on the other hand, the network safety is maintained, and the network environment is purified.

In the following, a spam detection apparatus provided by an embodiment of the present application is introduced, and a spam detection apparatus described below and a spam detection method described above may be referred to each other.

Referring to fig. 6, an embodiment of the present application discloses a spam detection apparatus, which is applied to a server and includes:

an obtaining module 601, configured to obtain user-generated content of a client;

the preprocessing module 602 is configured to perform normalization processing and word segmentation processing on the user generated content, respectively, to obtain to-be-detected word segments;

the vectorization module 603 is configured to map the to-be-detected participles into to-be-detected vectors by using a Skip-Gram model;

and the detection module 604 is configured to detect the vector to be detected by using the LSTM model to obtain a detection result.

In a specific embodiment, the obtaining module is specifically configured to:

In one embodiment, the system further comprises a training module, the training module is used for training the LSTM model, and the training module comprises:

the acquisition unit is used for acquiring a training word vector;

the selection unit is used for randomly selecting a plurality of training samples with preset sizes from the training word vectors;

the processing unit is used for processing each training sample by using the current LSTM model to obtain a plurality of training results;

the calculation unit is used for calculating the error of the current training result and the real detection result of the training sample corresponding to the current training result aiming at any training result to obtain a plurality of errors;

a first determining unit, configured to determine an average value of the plurality of errors as an error value of the current LSTM model;

and the second determining unit is used for determining the current LSTM model as the LSTM model if the error value is smaller than the preset error threshold.

In one embodiment, the training module further comprises:

and the execution unit is used for executing the steps in the selection unit if the error is not less than the preset error threshold.

In a specific embodiment, the method further comprises the following steps:

and the updating module is used for calculating the accuracy of the detection result, and if the accuracy is lower than a preset accuracy threshold, performing thermal updating on the LSTM model.

In one embodiment, the LSTM model comprises: an input layer, a hidden layer, and an output layer, each added with a Dropout operation.

In one embodiment, the server and client communicate through a gRPC.

For more specific working processes of each module and unit in this embodiment, reference may be made to corresponding contents disclosed in the foregoing embodiments, and details are not described here again.

As can be seen, the embodiment provides a spam detection device, and after user generated content is obtained, the device firstly performs normalization processing and word segmentation processing on the user generated content respectively to obtain to-be-detected word segments; then mapping the word segmentation to be detected into a vector to be detected by using a Skip-Gram model; and finally, detecting the vector to be detected by using an LSTM model to obtain a detection result. The Skip-Gram model can accurately express semantic information, the LSTM model has a gradient transfer characteristic, namely the output of any neuron in the LSTM model can be input into other neurons, so that the processing of current data processed by other neurons is related to historical data, therefore, the LSTM model can analyze a single participle and can also consider the relevance information among different participles, and the accuracy of a detection result can be improved.

In the following, a spam detection device provided by an embodiment of the present application is introduced, and a spam detection device described below and a spam detection method and device described above may be referred to each other.

Referring to fig. 7, an embodiment of the present application discloses a spam detection apparatus, including:

a memory 701 for storing a computer program;

a processor 702 for executing the computer program to implement the method disclosed in any of the embodiments above.

In the following, a readable storage medium provided by an embodiment of the present application is introduced, and a readable storage medium described below and a spam detection method, device and apparatus described above may be referred to each other.

A readable storage medium for storing a computer program, wherein the computer program, when executed by a processor, implements the spam detection method disclosed in the foregoing embodiments. For the specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, which are not described herein again.

References in this application to "first," "second," "third," "fourth," etc., if any, are intended to distinguish between similar elements and not necessarily to describe a particular order or sequence. It will be appreciated that the data so used may be interchanged under appropriate circumstances such that the embodiments described herein may be practiced otherwise than as specifically illustrated or described herein. Furthermore, the terms "comprises" and "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, or apparatus.

It should be noted that the descriptions in this application referring to "first", "second", etc. are for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In addition, technical solutions between various embodiments may be combined with each other, but must be realized by a person skilled in the art, and when the technical solutions are contradictory or cannot be realized, such a combination should not be considered to exist, and is not within the protection scope of the present application.

The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of readable storage medium known in the art.

The principle and the implementation of the present application are explained herein by applying specific examples, and the above description of the embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for a person skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A junk information detection method is applied to a server and comprises the following steps:

acquiring user generated content of a client;

respectively carrying out normalization processing and word segmentation processing on the user generated content to obtain to-be-detected word segments;

2. The spam detection method of claim 1, wherein the obtaining user-generated content of the client comprises:

and acquiring the user-generated content from the client according to the input rule of the LSTM model.

3. The spam detection method of claim 2, wherein the training process of the LSTM model comprises:

acquiring a training word vector;

determining an average value of the plurality of errors as an error value of the current LSTM model;

and if the error value is smaller than a preset error threshold value, determining the current LSTM model as the LSTM model.

4. The spam detection method according to claim 3, further comprising:

and if the error is not smaller than the preset error threshold, executing the step of randomly selecting a plurality of training samples with preset sizes from the training word vector.

5. The spam detection method according to claim 1, further comprising:

6. The spam detection method of claim 1, wherein the LSTM model comprises: an input layer, a hidden layer, and an output layer, each added with a Dropout operation.

7. The spam detection method according to any one of claims 1 to 6, wherein the server and the client communicate via a gPC.

8. The junk information detection device is applied to a server and comprises:

the preprocessing module is used for respectively carrying out normalization processing and word segmentation processing on the user generated content to obtain to-be-detected word segments;

9. A spam detection device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the spam detection method according to any of claims 1 to 7.

10. A readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the spam detection method according to any one of claims 1 to 7.