WO2020057413A1

WO2020057413A1 - Junk text identification method and device, computing device and readable storage medium

Info

Publication number: WO2020057413A1
Application number: PCT/CN2019/105348
Authority: WO
Inventors: 高喆; 康杨杨; 周笑添; 孙常龙; 刘晓钟; 司罗
Original assignee: 阿里巴巴集团控股有限公司
Priority date: 2018-09-17
Filing date: 2019-09-11
Publication date: 2020-03-26
Also published as: CN110929025A; CN110929025B

Abstract

Disclosed by the present invention is a junk text identification method and device, a computing device and a readable storage medium. One method embodiment comprises the steps of: dividing a text to be recognized so as to obtain a division result; generating a feature vector for the text to be recognized on the basis of the division result; inputting the feature vector into a plurality of first classification models so as to obtain outputs of the plurality of first classification models, the first classification model comprising a linear classification model and a deep learning classification model; at least combining the outputs of the plurality of first classification models, so as to obtain a combined vector; and using a second classification model to determine whether the text to be recognized is junk text according to the combined vector. Also disclosed by the present invention are a corresponding junk text identification device, a computing device and a readable storage medium.

Description

Method, device, computing device and readable storage medium for identifying junk text

This application claims the priority of a Chinese patent application filed on September 17, 2018 with an application number of 201811083369.8 and an invention name of "Recognition Method, Device, Computing Device, and Readable Storage Medium for Junk Text", the entire contents of which are incorporated by reference In this application.

Technical field

The invention relates to the field of artificial intelligence technology, and in particular, to a method, a device, a computing device, and a readable storage medium for identifying junk text.

Background technique

With the development and popularization of Internet technology, more and more documents and conversations are stored and used on the network in electronic form. In order to process documents and conversation content, natural language processing technology is becoming increasingly popular. In the field of natural language processing, the issue of spam text recognition is receiving increasing attention.

Specifically, unwanted users publish or send spam texts including spam such as pornographic information and uncivilized terms on the Internet, which seriously affects the healthy development of the Internet. Therefore, it is necessary to identify the junk text on the Internet in order to filter or delete the junk text.

In current spam text recognition schemes, traditional feature extraction methods (such as bag-of-words models) and traditional machine learning (such as support vector machines) classification models have poor recognition and semantic expression capabilities, and use word embeddings. Algorithms and deep learning (such as neural network) models require a large amount of training data, and the models are too complex and extremely easy to overfit.

Therefore, a more advanced garbage text recognition scheme is urgently needed.

Summary of the Invention

To this end, embodiments of the present invention provide a method, a device, a computing device, and a readable storage medium for identifying junk text, in an effort to solve or at least alleviate at least one of the problems above.

According to an aspect of the embodiment of the present invention, a method for identifying junk text is provided. The method includes the steps of: dividing a text to be recognized to obtain a division result; generating a feature vector for the text to be recognized based on the division result; A vector is input to a plurality of first classification models to obtain outputs of the plurality of first classification models. The first classification model includes a linear classification model and a deep learning classification model; at least the outputs of the plurality of first classification models are combined to obtain a combined vector. ; And based on the combined vector, a second classification model is used to determine whether the text to be recognized is junk text.

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the step of generating a feature vector for the text to be identified based on the division result includes: at least a bag-of-words model, and generating a first vector of the text to be identified based on the division result. ; Using a word embedding model, a second vector of text to be recognized is generated based on the division result.

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the step of inputting the feature vector into a plurality of first classification models includes: inputting a first vector generated based on the division result into a linear classification model; The second vector generated by the division result is input into a deep learning classification model.

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the text to be identified is a message including a message signature, and the method further includes the step of calculating a historical probability that the text to be identified is junk text based on the message signature. Correspondingly, at least combining the outputs of the plurality of first classification models to obtain a combination vector includes: combining the historical probability and the outputs of the plurality of first classification models to obtain a combination vector.

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the step of calculating a historical probability that the text to be identified is junk text based on the message signature includes: obtaining a historical message including a message signature including the text to be identified; and calculating the history The ratio of the part of the historical message in the message determined as junk text to the number of the historical message is used as the historical probability.

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the step of using the second classification model to determine whether the text to be identified is junk text according to the combination vector includes: entering the combination vector into the second classification model, Get the output of the second classification model; determine whether the text to be recognized is junk text according to the output of the second classification model.

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the second classification model includes an integrated learning classification model, and the integrated learning classification model includes a predetermined number of sub-classification models, and a combination vector is input to the second classification model to obtain The output of the second classification model includes: inputting the combination vector into each sub-classification model included in the integrated learning classification model separately, so as to adopt a voting mechanism to determine the output of the second classification model according to the output of each sub-classification model.

Optionally, in the method for recognizing junk text according to the embodiment of the present invention, the first classification model is obtained by training using the first training set and using feature vectors as input, and the second classification model is using the second training set by combining The vectors are obtained by input training. The first training set and the second training set are sampled from the full training set. The full training set includes multiple labeled samples, and the labels indicate whether the samples are junk text.

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the sub-classification model included in the integrated learning model is trained using a sub-training set corresponding to the sub-classification model, and the sub-training set is obtained from the second training set Sampling evenly replaced.

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the linear classification model is trained using L1 regularization, and the deep learning classification model is trained using a discard mechanism.

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the linear classification model includes a logistic regression model and / or a support vector machine model, and the deep learning classification model includes a convolutional neural network model and / or a recurrent neural network model .

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the integrated learning classification model includes a random forest model or a gradient boosting decision tree model.

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the division result includes multiple division results. For each division result, a feature vector is generated for the text to be identified based on the division result, and the feature vector is inputted. A plurality of first classification models corresponding to the division result to obtain outputs of the plurality of first classification models corresponding to the division result, and a plurality of first classification models corresponding to the division result to be based on the division result The generated feature vector is obtained by input training, and at least the outputs of multiple first classification models are combined. The step of obtaining a combined vector includes: at least combining the outputs of the first classification model corresponding to the respective division results to obtain a combined vector. .

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the text segmentation to be recognized to obtain multiple division results includes: text segmentation of the text to be identified based on a segmentation algorithm to obtain a plurality of segmented text. Segmentation results; text division is performed on the text to be recognized based on the n-gram language model, and a division result including a plurality of n-grams is obtained.

Optionally, in the method for identifying junk text according to the embodiment of the present invention, the text division of the text to be recognized based on the n-gram language model to obtain a division result including a plurality of n-gram sequences includes: Recognize text and divide the text to obtain a division result that includes multiple binary sequences. Based on the ternary language model, divide the text to be recognized to obtain a division result that includes multiple trigrams.

According to another aspect of the embodiments of the present invention, a device for identifying junk text is provided, including: a text division unit adapted to perform text division on a text to be identified to obtain a division result; and a feature learning unit adapted to be based on the division result as The to-be-recognized text generates a feature vector; a first classification unit is adapted to input the feature vector into multiple first classification models to obtain multiple first classification models output, the first classification model includes a linear classification model and a deep learning classification model ; And a feature combination unit adapted to combine at least the outputs of multiple first classification models to obtain a combination vector; a second classification unit adapted to use the second classification model to determine whether the text to be identified is junk text based on the combination vector .

According to another aspect of the embodiments of the present invention, there is provided a computing device including: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be implemented by One or more processors execute, and the one or more programs include instructions for performing any one of the methods for identifying junk text according to an embodiment of the present invention.

According to yet another aspect of the embodiments of the present invention, a readable storage medium storing a program is provided, and the program includes instructions that, when executed by a computing device, cause the computing device to execute junk text according to an embodiment of the present invention. Any of the identification methods.

The method for identifying junk text according to the embodiment of the present invention is based on a stacking algorithm, and integrates a plurality of first classification models through a second classification model to obtain classification results. Combining the advantages of multiple types of first classification models greatly The recognition ability of junk text is improved, and the performance of the model is better.

Further, based on the stacking algorithm, combined with the guided bagging algorithm, the integrated learning classification model based on the bagging algorithm is used as the second classification model, and the advantages of multiple sub-classification models included in the integrated learning classification model are further improved. Recognition of junk text while preventing overfitting of the model.

Further, by obtaining multiple division results, the error of the single division result passed to the classification model is compensated, and the influencing factors of the classification results of the text to be identified on the classification are described at multiple granularities, which further improves the garbage text. Recognition.

Further, the L1 regularization term of the linear classification model ensures the sparseness of the features, thereby ensuring better spam text recognition, and further prevents the model from overfitting through the discard mechanism of the deep learning classification model.

BRIEF DESCRIPTION OF THE DRAWINGS

To achieve the above and related objectives, certain illustrative aspects are described herein in conjunction with the following description and accompanying drawings, which indicate various ways in which the principles disclosed herein can be practiced, and all aspects and their equivalents are intended to be Within the scope of the claimed subject matter. The above and other objects, features, and advantages of the present disclosure will become more apparent by reading the following detailed description in conjunction with the accompanying drawings. Throughout this disclosure, the same reference numerals generally refer to the same parts or elements.

FIG. 1 shows a schematic diagram of a junk text recognition system 100 according to an embodiment of the present invention;

FIG. 2 shows a structural block diagram of a device 200 for identifying junk text according to an embodiment of the present invention;

FIG. 3 shows a structural block diagram of a computing device 300 according to an embodiment of the present invention;

FIG. 4 shows a flowchart of a method 400 for identifying junk text according to an embodiment of the present invention; and

FIG. 5 shows a structural block diagram of a junk text recognition device 500 according to an embodiment of the present invention.

detailed description

Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a thorough understanding of the present disclosure, and to fully convey the scope of the present disclosure to those skilled in the art.

FIG. 1 shows a schematic diagram of a junk text recognition system 100 according to an embodiment of the present invention. As shown in FIG. 1, the junk text recognition system 100 includes a recognition device 200 that recognizes a front end 110 and junk text. Spam text here refers to abnormal text, or text that includes spam. A typical junk text is text that includes pornographic information. The following is an example of junk text:

"Is it Su Jing? Hatano Yui joins the four talented queens to make a comeback and provide you with a video viewing address http://tb.cn/GIQZkPw".

The identification front end 110 is any requester that needs to determine whether the text to be identified is junk text. For example, in one approach, the processing front end 110 may be part of an instant messaging system. The instant messaging system can receive messages entered by the user. If the message is junk text, the instant messaging system needs to intercept the message. If the message is not junk text, the instant messaging system can send the message. Therefore, the recognition front end 110 sends the message to the junk text recognition device 200 to determine the junk text to determine whether the message is junk text.

The identification front end 110 may also be part of a review review system. The review moderation system can receive reviews entered by the user. If the comment is spam, the comment review system needs to block the comment. If the comment is not spam, the comment review system can post the comment. In this case, the recognition front end 110 may send a spam text recognition request including the comment to the spam text recognition device 200 for processing.

The invention is not limited to the specific form of identifying the front end 110. The junk text recognition device 200 may also receive requests to identify the front end 110 in various ways. For example, the junk text recognition device 200 may provide an application program interface (API) with a predetermined format definition to facilitate the recognition front end 110 to organize the junk text recognition request according to the definition and send it to the junk text recognition device 200.

The junk text recognition device 200 receives the request, obtains the text to be recognized from the request, and determines whether the text to be recognized is junk text.

FIG. 2 shows a structural block diagram of a device 200 for identifying junk text according to an embodiment of the present invention. As shown in FIG. 2, the junk text recognition device 200 includes a text division unit 210, a feature learning unit 220, a first classification unit 230, a feature combination unit 240, and a second classification unit 250.

The text dividing unit 210 is adapted to perform text division on the text to be recognized to obtain a division result. The feature learning unit 220 is adapted to generate a feature vector for the text to be recognized based on the division result. The first classification unit 230 is adapted to input the feature vector into a plurality of first classification models to obtain outputs of the plurality of first classification models. The first classification model may include a linear classification model and a deep learning classification model. The feature combination unit 240 is adapted to combine at least the outputs of the plurality of first classification models to obtain a combination vector. The second classification unit 250 is adapted to use the second classification model to determine whether the text to be recognized is junk text based on the combined vector.

Each of the first classification unit 230 and the second classification unit 250 may include a plurality of processing modules. The processing module included in the first classification unit 230 may implement a plurality of first classification models expected. The processing module included in the second classification and classification unit 250 may implement an intended second classification model. The output of the first classification model and the second classification model may both indicate whether the text to be identified is junk text, or the probability of the text to be identified being junk text.

The first classification model and the second classification model have a large number of calculation parameters, which need to be adjusted through training in order to obtain the best classification effect in actual use. Therefore, each processing module in the first classification unit 230 and the second classification unit 250 includes a large number of calculation parameters waiting for training. As shown in FIG. 1, the junk text recognition system 100 further includes a full training set 130. The first classification model and the second classification model may be trained using the samples in the full training set 130. The full training set 130 includes a plurality of labeled samples indicating whether the samples are junk text. Samples usually include positive and negative samples. A label of a positive sample indicates that the sample is junk text, and a label of a negative sample indicates that the sample is not junk text. It is necessary to include a predetermined proportion of positive samples and negative samples in the full training set 130, so that the first classification model and the second classification model can be more comprehensively trained.

Specifically, the first classification model is obtained by training using the first training set and using the above feature vector as an input. The second classification model is obtained by using the second training set and the above combination vector as input training, and the first training set and the second training set are both sampled from the full training set. The first training set is not equal to the second training set. That is, the first training set and the second training set may have intersections, but they may not be equivalent. Of course, the first training set and the second training set may also have no intersection.

Among them, in order to ensure the sparseness of the features, the linear classification model can be trained using L1 regularization, that is, the loss function of the linear classification model can have the L1 regularization term (that is, L1 norm). In order to prevent the model from overfitting, the deep learning classification model can be trained using a dropout mechanism, that is, during the training process, the deep learning classification model can discard some neurons according to a predetermined discarding ratio.

To further avoid overfitting, the second classification model may include an ensemble learning classification model, which is a ensemble learning classification model based on Bootstrap Aggregating (bagging algorithm for short).

The ensemble learning classification model includes a predetermined number of sub-classification models. A predetermined number of sub-training sets of the same size can be sampled from the second data set uniformly and reproducibly (that is, using a self-service sampling method) to train a predetermined number of sub-classification models. The predetermined number of sub-training sets correspond one-to-one with the predetermined number of sub-classification models. That is, the sub-classification model is trained using a sub-training set corresponding to the sub-classification model.

After training each sub-classification model, the second classification unit 250 may input the above combination vectors into each sub-classification model included in the integrated learning classification model separately, so as to adopt a voting mechanism or take an average, and determine the first according to the output of each sub-classification model. Output of the binary classification model. Then, it is determined whether the text to be recognized is junk text according to the output of the second classification model.

According to an embodiment of the present invention, the division result may include multiple division results. For each segmentation result, a plurality of first classification models corresponding to the segmentation result may be obtained by training the feature vector generated based on the segmentation result as input.

Correspondingly, the feature learning unit 220 may generate a feature vector for the text to be recognized for each division result based on the division result. The first classification unit 230 may, for each division result, input a feature vector based on the division result into a plurality of first classification models corresponding to the division result, so as to obtain a plurality of first classification models corresponding to the division result. Output. Therefore, at least combining the outputs of multiple first classification models to obtain a combination vector may include: at least combining the outputs of the first classification models corresponding to the division results to obtain the combination vector.

According to another embodiment of the present invention, when the text to be recognized is a message including a message signature, the device 200 for identifying junk text may further include a historical probability calculation unit 260 (not shown in FIG. 1). The historical probability calculation unit 260 is adapted to calculate a historical probability that the text to be identified is junk text based on the message signature. Accordingly, the historical probability and the outputs of all the first classification models can be combined to obtain the above-mentioned combination vector.

Here, the message (Short Message) refers to the text sent from one party (ie, the message sender) to the other party (ie, the message receiver), and includes the message signature. The message signature is used to uniquely identify the sender of the message, which can usually be the company name, brand name, project name, or application name. The message signature is usually located at the beginning of the message, and is separated from other contents by a separator such as "[]". The following is an example of the message: "[XX Takeaway] Your takeaway has been delivered.". Among them, "XX Takeaway" is the message signature of the message.

In the following, the specific structures of the various devices and units mentioned above and the corresponding processing methods will be described with reference to the drawings.

According to the embodiment of the present invention, various components in the above-mentioned junk text recognition system 100, such as various units and devices, may be implemented by the computing device 300 described below. FIG. 3 shows a schematic diagram of a computing device 300 according to one embodiment of the invention.

As shown in FIG. 3, in a basic configuration 302, the computing device 300 typically includes a system memory 306 and one or more processors 304. The memory bus 308 may be used for communication between the processor 304 and the system memory 306.

Depending on the desired configuration, the processor 304 may be any type of processing, including but not limited to: a microprocessor (μP), a microcontroller (μC), a digital information processor (DSP), or any combination thereof. The processor 304 may include one or more levels of cache, such as a primary cache 310 and a secondary cache 312, a processor core 314, and a register 316. The example processor core 314 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP core), or any combination thereof. An example memory controller 318 may be used with the processor 304, or in some implementations, the memory controller 318 may be an internal part of the processor 304.

Depending on the desired configuration, the system memory 306 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The system memory 306 may include an operating system 320, one or more applications 322, and program data 324. In some embodiments, the application 322 may be arranged to execute instructions on the operating system by the one or more processors 304 using program data 324.

Depending on the desired configuration, the computing device 300 may further include a storage device 332 and a storage interface bus 334 in which the removable storage 336 and the non-removable storage 338 may be controlled.

The computing device 300 may also include an interface bus 340 that facilitates communication from various interface devices (eg, output devices 342, peripheral interfaces 344, and communication devices 346) to the basic configuration 302 via the bus / interface controller 330. The example output device 342 includes a graphics processing unit 348 and an audio processing unit 350. They can be configured to facilitate communication with various external devices such as a display or speakers via one or more A / V ports 352. The example peripheral interface 344 may include a serial interface controller 354 and a parallel interface controller 356, which may be configured to facilitate communication via one or more I / O ports 358 and such as input devices (e.g., keyboard, mouse, pen , Voice input devices, touch input devices) or other peripheral devices (such as printers, scanners, etc.) to communicate. An example communication device 346 may include a network controller 360, which may be arranged to facilitate communication with one or more other computing devices 362 over a network communication link via one or more communication ports 364.

A network communication link may be one example of a communication medium. Communication media may typically be embodied as computer-readable instructions, data structures, program modules in a modulated data signal such as a carrier wave or other transmission mechanism, and may include any information delivery media. A "modulated data signal" can be a signal in which one or more of its data sets or its changes can be made in a manner that encodes information in the signal. As non-limiting examples, communication media may include wired media such as a wired network or a dedicated network, and various wireless media such as sound, radio frequency (RF), microwave, infrared (IR), or other wireless media. The term computer-readable media as used herein may include both storage media and communication media.

The computing device 300 may be implemented as a server, such as a database server, an application server, and a WEB server, and may also be implemented as a personal computer including a desktop computer and a notebook computer configuration. Of course, the computing device 300 may also be implemented as part of a small-sized portable (or mobile) electronic device.

In an embodiment according to the present invention, the computing device 300 is implemented as a device 200 for identifying junk text, and is configured to perform a method 400 for identifying a junk text according to an embodiment of the present invention. The application 322 of the computing device 300 includes multiple program instructions for executing the method 400 for identifying junk text according to an embodiment of the present invention, and the program data 324 may further store configuration information of the device 200 for identifying junk text.

FIG. 4 shows a flowchart of a method 400 for identifying junk text according to an embodiment of the present invention. As shown in FIG. 4, the method 400 for identifying junk text starts at step S410.

In step S410, text division is performed on the text to be recognized to obtain a division result.

Any text division method in the art can be used to perform text division on the text to be recognized. For example, a text segmentation algorithm may be used to perform text segmentation on the text to be recognized, and a segmentation result including multiple segmentations may be obtained. The invention does not limit the specific word segmentation algorithm. Here is an example:

The word segmentation algorithm was used to divide the text into "the weather is very good today", and the results of the segmentation including the words "today", "weather", "very" and "good" were obtained.

For another example, text division may be performed on the text to be recognized based on the n-gram language model to obtain a division result including a plurality of n-grams. In the embodiment of the present invention, n usually takes a value of 2 or 3.

The n-gram language model is a probabilistic language model based on the (n-1) -order Markov chain, which infers the structure of a sentence by the probability of the occurrence of n items. An n-gram is a continuous sequence of n items from a given text. Items can usually be phonemes, syllables, letters, words, words. In the embodiment of the present invention, the n-gram is a continuous sequence of n words from the text to be recognized. Here are two examples:

Using the binary language model algorithm to divide the text of "the weather is very good today", we get five binary sequences including "today", "every day", "weather", "qi very" and "good" Divide the results.

Using the ternary language model algorithm to divide the text "today's weather is good", we get four trigrams including "today today", "weather", "weather is very" and "gas is good" result.

Then in step S420, a feature vector is generated for the text to be recognized based on the division result. Specifically, the feature vector may include a first vector and a second vector.

A bag of words model can be used at least, and a first vector of text to be recognized is generated based on the division result. Among them, a bag-of-words model may be adopted first, and a bag-of-words vector of text to be recognized is generated based on the division result. Then, a feature extraction method such as a word frequency-inverse document frequency (TF-IDF) algorithm and a mutual information algorithm is used to process the bag of words vector to obtain a first vector.

A word embedding model may be adopted, and a second vector, that is, a word vector, of the text to be recognized is generated based on the division result. The present invention does not limit the specific word embedding model. For example, a Skip-Gram model or a CBOW (continuous bags of word) model can be used.

Subsequently, in step S430, the feature vectors are input to a plurality of first classification models to obtain outputs of the plurality of first classification models.

Since the first classification model may include two types of models, a linear classification model and a deep learning classification model, the first vector generated based on the above-mentioned division result may be input into the linear classification model, and the second vector generated based on the above-mentioned division result may be input into the depth Learn classification models.

After obtaining the outputs of the plurality of first classification models, in step S440, at least the outputs of the plurality of first classification models are combined to obtain a combination vector. In simple terms, at least the outputs of multiple first classification models can be stitched to obtain a combined vector.

According to an embodiment of the present invention, the text to be recognized may be a message introduced in the foregoing, and the message includes a message signature. Therefore, the method 400 for identifying junk text may further include the step of calculating a historical probability that the text to be identified is junk text based on the message signature. Specifically, a historical message including a message signature of the text to be recognized may be obtained (which may be obtained from a preset historical message database). Calculate the ratio of the part of the obtained historical messages determined as junk text to the number of acquired historical messages as the historical probability that the text to be identified is junk text.

Correspondingly, the historical probability of the text to be recognized as junk text can be combined with the output of all the first classification models to obtain the above-mentioned combination vector. In this way, by introducing the historical probability calculated based on the message signature, the risk of the text to be identified as junk text can be evaluated from various dimensions.

Finally, in step S450, according to the combined vector, a second classification model may be used to determine whether the text to be recognized is junk text. That is, the combination vector can be input to the second classification model to obtain the output of the second classification model. Then, it is determined whether the text to be recognized is junk text according to the output of the second classification model.

According to an embodiment of the present invention, the first classification model is obtained by training using the first training set and using the feature vector as an input, and the second classification model is obtained by using the second training set and using the combination vector as the input. The first training set and the second training set are both sampled from the full training set. The full training set includes multiple labeled samples that indicate whether the samples are junk text. Samples usually include positive and negative samples. A label of a positive sample indicates that the sample is junk text, and a label of a negative sample indicates that the sample is not junk text. It is necessary to include a predetermined proportion of positive samples and negative samples in the full training set 130, so that the first classification model and the second classification model can be more comprehensively trained.

The first training set is not equal to the second training set. That is, the first training set and the second training set may have intersections, but they may not be equivalent. Of course, the first training set and the second training set may also have no intersection.

Among them, on the one hand, the linear classification model can be trained using L1 regularization to ensure the sparseness of features. That is, the loss function of the linear classification model may have an L1 regularization term (ie, L1 norm). In one embodiment, the linear classification model may include a logistic regression model and / or a support vector machine (SVM) model, and the logistic regression model may be trained using L1 regularization.

The present invention does not limit the specific linear classification model. In addition to the logistic regression model and the support vector machine model, it can also be other linear classification models.

On the other hand, deep learning classification models can be trained using the discard mechanism to prevent overfitting. That is, the deep learning classification model can discard some neurons according to a predetermined discarding ratio during the training process. In one embodiment, the deep learning classification model may include a convolutional neural network model (CNN) and / or a recurrent neural network model (RNN).

The invention does not limit the specific deep learning classification model. In addition to the convolutional neural network model and the recurrent neural network model, it can also be other deep learning classification models.

In addition, in order to further improve the accuracy of classification results and prevent overfitting, the second classification model may include an integrated learning classification model, and specifically, may include an integrated learning classification model based on a guided aggregation algorithm. The ensemble learning classification model includes a predetermined number of sub-categorization models. The above combination vectors can be input into each sub-categorization model included in the ensemble learning classification model separately, in order to adopt a voting mechanism or average, and determine the second according to the output of each sub-categorization model. The output of the classification model.

Among them, the sub-classification model included in the ensemble learning model is trained using a sub-training set corresponding to the sub-classification model, and the sub-training set is sampled from the second training set evenly and reproducibly. Specifically, a predetermined number of sub-training sets of the same size can be sampled from the second data set uniformly and with replacement (that is, using a self-service sampling method) to train a predetermined number of sub-classification models. The predetermined number of sub-training sets correspond one-to-one with the predetermined number of sub-classification models.

The invention does not limit the specific ensemble learning model. The ensemble learning classification model may be, for example, a random forest model or a gradient boosted decision tree model (GBDT model), where the sub-classification model is a decision tree. The predetermined number can usually take the value 100.

In addition, according to an embodiment of the present invention, the division result includes a plurality of division results. Specifically, text segmentation can be performed based on a segmentation algorithm to obtain a segmentation result including a plurality of segmentations. And text division is performed on the text to be recognized based on the n-gram language model, and a division result including multiple n-gram sequences is obtained.

Among them, the division result including multiple n-ary sequences may include the division result of multiple binary sequences and the division result including multiple ternary sequences, which are respectively text division based on the binary language model and the ternary language model to be recognized. And get.

Obviously, for each division result, it is necessary to use the feature vector generated based on the division result as an input to train a plurality of first classification models corresponding to the division result.

For each division result, a feature vector can be generated for the text to be recognized based on the division result, and the feature vector generated based on the division result is input to a plurality of first classification models corresponding to the division result to obtain the division result. Outputs of corresponding first classification models. In this way, at least the outputs of multiple first classification models are combined to obtain a combined vector.

At least the outputs of the first classification model corresponding to the division results are combined to obtain a combination vector.

According to an embodiment of the present invention, different division results correspond to different L1 regularization terms. A linear classification model corresponding to a segmentation result including a plurality of word segmentation usually has a high accuracy rate and a low recall rate, so the L1 regularization term to be added is small. The linear classification model corresponding to the division results including multiple bigrams usually has low accuracy and high recall, so the L1 regularization term that needs to be added is large. The linear classification model corresponding to the division results including multiple trigrams has moderate accuracy and recall. However, the scale of feature vectors generated based on the division results including multiple trigrams is very high, so it is necessary to add a larger L1 regularization term. In the embodiment of the present invention, the L1 regularization term corresponding to the segmentation result including multiple segmentations is the smallest, the L1 regularization term corresponding to the segmentation result including multiple trigrams is centered, and the L1 regularization corresponding to the segmentation result including multiple bigrams. The item is the largest.

According to another embodiment of the present invention, different division results correspond to different discarding ratios. Similar to the L1 regularization term, in the embodiment of the present invention, the L1 regularization term corresponding to the segmentation result including multiple segmentations is the smallest, and the L1 regularization term corresponding to the segmentation result including multiple trigrams is centered, including multiple bigrams. The L1 regularization term corresponding to the division result is the largest.

FIG. 5 shows a structural block diagram of a junk text recognition device 500 according to an embodiment of the present invention. As shown in FIG. 5, the junk text recognition device 500 may include a first text dividing unit 510, a second text dividing unit 512, a third text dividing unit 514, a first feature learning unit 520, a second feature learning unit 522, a first A basic classification unit 530, a second basic classification unit 532, a third basic classification unit 534, an output combining unit 540, and an integrated classification unit 550.

The first text division unit 510 is adapted to perform text division on the text to be recognized based on the word segmentation algorithm, to obtain a first division result including a plurality of word segmentation. The second text division unit 512 is adapted to perform text division on the text to be recognized based on the binary language model, and obtain a second division result including a plurality of binary sequences. The third text division unit 514 is adapted to perform text division on the text to be recognized based on the ternary language model, and obtain a third division result including a plurality of ternary sequences.

The first feature learning unit 520 is connected to the first text division unit 510, the second text division unit 512, and the third text division unit 514, respectively, and is adapted to use at least a bag-of-words model to generate the text to be recognized based on the first division result. A first vector, generating a first vector for the text to be recognized based on the second division result, and generating a first vector for the text to be recognized based on the third division result.

The second feature learning unit 522 is connected to the first text division unit 510, the second text division unit 512, and the third text division unit 514, respectively, and is adapted to use a word embedding model to generate a first text segment for the text to be recognized based on the first segmentation result. Two vectors, generating a second vector for the text to be recognized based on the second division result, and generating a second vector for the text to be recognized based on the third division result.

The first basic classification unit 530 is connected to the first feature learning unit 520 and the second feature learning unit 522, respectively, and is adapted to respectively input first vectors generated based on the first division result into the first corresponding to the first division result. The logistic regression model and the first support vector machine model input a second vector generated based on the first partition result into a first convolutional neural network model corresponding to the first partition result to obtain a first logistic regression model, a first support The output of the vector machine model and the first convolutional neural network model.

The second basic classification unit 532 is connected to the first feature learning unit 520 and the second feature learning unit 522, respectively, and is adapted to respectively input the first vectors generated based on the second division result into the second corresponding to the second division result. Logistic regression model and second support vector machine model. The second vector generated based on the second partition result is input to a second convolutional neural network model corresponding to the second partition result to obtain a second logistic regression model, a second support The output of the vector machine model and the second convolutional neural network model.

The third basic classification unit 534 is connected to the first feature learning unit 520 and the second feature learning unit 522, respectively, and is adapted to respectively input the first vectors generated based on the third division result into the third corresponding to the third division result. Logistic regression model and third support vector machine model. The second vector generated based on the third partition result is input to a third convolutional neural network model corresponding to the third partition result to obtain a third logistic regression model and a third support. Output of the vector machine model and the third convolutional neural network model.

The output combining unit 540 is connected to the first basic classification unit 530, the second basic classification unit 532, and the third basic classification unit 534, respectively, and is adapted to integrate the first logistic regression model, the first support vector machine model, and the first convolution. Combining the output of a neural network model, a second logistic regression model, a second support vector machine model, a second convolutional neural network model, a third logistic regression model, a third support vector machine model, and a third convolutional neural network model, Get the combination vector.

According to an embodiment of the present invention, in a case where the text to be recognized is a message signature including a message, the output combining unit 540 may use the historical probability of the text to be recognized as junk text, a first logistic regression model, and a first support vector machine. Model, first convolutional neural network model, second logistic regression model, second support vector machine model, second convolutional neural network model, third logistic regression model, third support vector machine model, and third convolutional neural network The outputs of the models are combined to obtain a combined vector.

The integrated classification unit 550 is connected to the output combination unit 540, and is suitable for inputting the combination vector into each decision tree included in the random forest model, so as to use a voting mechanism to determine the output of the random forest model according to the output of each decision tree. Finally, according to the output of the random forest model, it is judged whether the text to be identified is junk text.

Corresponding processing in each unit of the device 500 for identifying junk text has been explained in detail in the detailed description of the method 400 for identifying junk text with reference to FIG. 1 to FIG. 4, and the repeated content will not be repeated here.

In summary, the method for identifying junk text according to the embodiment of the present invention is based on a stacking algorithm, and integrates a plurality of first classification models through a second classification model to obtain classification results. The advantages of combining multiple types of first classification models are combined. , Which greatly improves the recognition ability of junk text, and the performance of the model is better.

Further, a guided bagging algorithm is combined on the basis of the stacking algorithm, and an integrated learning classification model based on the bagging algorithm is used as the second classification model to further improve the recognition ability of junk text while preventing the model from overfitting.

Further, the method also obtains multiple division results to make up for the error of the single division result passed to the classification model. At the same time, the influencing factors of the text features of the text to be identified on the classification are described at multiple granularities to further improve the junk text. Recognition ability.

Furthermore, the L1 regularization term of the linear classification model also ensures the sparseness of the features, thereby ensuring better spam text recognition. The discard mechanism of the deep learning classification model further prevents the model from overfitting.

Taking the identification of messages that contain pornographic information as an example, the main difficulties in identifying them are:

The proportion of messages containing pornographic information relative to normal messages is extremely low, usually above 1: 10000. The variance is large, the types of messages covered are very wide, and the expression forms of pornographic messages are also extremely complicated. In addition, there are many variants of pornographic messages, which are very vague. Based on these difficulties, the traditional spam text recognition scheme has limited recognition capabilities. For example, in a case where only the foregoing first division result, the first vector, and the support vector machine model are used, the F1 value of the model is 0.954. In the case where only the aforementioned second division result, the first vector, and the support vector machine model are used, the F1 value of the model is 0.961. In the case where only the foregoing first division result, the second vector, and the convolutional neural network model are used, the F1 value of the model is 0.971. According to the spam text recognition scheme of the embodiment shown in FIG. 5 of the present invention, the model's F1 value is 0.987. The F1 value is Fscore, which is the harmonic mean of model accuracy and recall. Generally, the larger the F1 value, the better the performance of the model.

Obviously, according to the junk text recognition scheme of the embodiment of the present invention, the performance of the model is better and superior, and the recognition is more accurate.

It should be understood that the various techniques described herein may be implemented in conjunction with hardware or software, or a combination thereof. Thus, the method and apparatus of the present invention, or some aspects or parts of the method and apparatus of the present invention, may take program code embedded in a tangible medium, such as a floppy disk, CD-ROM, hard drive, or any other machine-readable storage medium (I.e., instructions) in which when a program is loaded into and executed by a machine such as a computer, the machine becomes a device for practicing the present invention.

Where the program code is executed on a programmable computer, the computing device generally includes a processor, a processor-readable storage medium (including volatile and non-volatile memory and / or storage elements), at least one input device, And at least one output device. The memory is configured to store program code; the processor is configured to execute various methods of the present invention according to instructions in the program code stored in the memory.

By way of example, and not limitation, computer-readable media includes computer storage media and communication media. Computer-readable media includes computer storage media and communication media. The computer storage medium stores information such as computer-readable instructions, data structures, program modules, or other data. Communication media typically embodies computer-readable instructions, data structures, program modules, or other data with a modulated data signal such as a carrier wave or other transmission mechanism, and includes any information delivery media. Combinations of any of the above are also included within the scope of computer-readable media.

It should be understood that, in order to simplify the present disclosure and help to understand one or more of the various aspects of the invention, in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together into a single embodiment, diagram, or In its description. However, this disclosed method should not be construed to reflect the intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single embodiment disclosed previously. Thus, the claims following a specific embodiment are hereby explicitly incorporated into this specific embodiment, wherein each claim itself is a separate embodiment of the present invention.

Those skilled in the art should understand that the modules or units or components of the device in the example disclosed herein may be arranged in the device as described in this embodiment, or alternatively may be positioned differently from the device in this example Of one or more devices. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.

Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and set in one or more devices different from the embodiment. The modules or units or components in the embodiment may be combined into one module or unit or component, and furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Except for such features and / or processes or units, which are mutually exclusive, all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any methods so disclosed may be employed in any combination or All processes or units of the equipment are combined. Each feature disclosed in this specification (including the accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

In addition, those skilled in the art can understand that although some embodiments described herein include certain features included in other embodiments and not other features, the combination of features of different embodiments is meant to be within the scope of the present invention Within and form different embodiments. For example, in the following claims, any one of the claimed embodiments can be used in any combination.

Furthermore, some of the described embodiments are described herein as methods or combinations of method elements that can be implemented by a processor of a computer system or by other devices that perform the described functions. Thus, a processor having the necessary instructions for implementing the method or method element forms a means for implementing the method or method element. Furthermore, the elements of the device embodiment described here are examples of devices that are used to implement functions performed by the elements for the purpose of implementing the invention.

As used herein, unless otherwise specified, the use of the ordinal numbers "first", "second", "third", etc. to describe ordinary objects merely means different instances involving similar objects, and is not intended to imply such The described objects must have a given order in time, space, order, or in any other way.

Although the invention has been described in terms of a limited number of embodiments, benefiting from the above description, those skilled in the art will appreciate that other embodiments are contemplated within the scope of the invention as described herein. Furthermore, it should be noted that the language used in this specification was selected primarily for readability and teaching purposes, and not for the purpose of explaining or limiting the subject matter of the present invention. Accordingly, many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the appended claims. As for the scope of the present invention, the disclosure of the present invention is illustrative and not restrictive, and the scope of the present invention is defined by the appended claims.

Claims

A method for identifying junk text includes the following steps:

Perform text division on the recognized text to get the division result;

Generating a feature vector for the text to be recognized based on the division result;

Inputting the feature vector into a plurality of first classification models to obtain an output of the plurality of first classification models, where the first classification model includes a linear classification model and a deep learning classification model;

Combining at least the outputs of the plurality of first classification models to obtain a combination vector; and

According to the combination vector, a second classification model is used to determine whether the text to be recognized is junk text.
The method according to claim 1, wherein the step of generating a feature vector for the text to be recognized based on the division result comprises:

At least a bag-of-words model is used to generate a first vector of the text to be recognized based on the division result;

A word embedding model is used to generate a second vector of the text to be recognized based on the division result.
The method of claim 2, wherein the step of inputting the feature vector into a plurality of first classification models comprises:

Inputting a first vector generated based on the division result into a linear classification model;

The second vector generated based on the division result is input into a deep learning classification model.
The method according to claim 1, wherein the text to be recognized is a message, the message includes a message signature, and the method further comprises the step of:

Calculating a historical probability that the text to be identified is junk text based on the message signature;

Accordingly, the step of combining at least the outputs of the plurality of first classification models to obtain a combination vector includes:

Combining the historical probability and the outputs of the plurality of first classification models to obtain the combination vector.
The method according to claim 4, wherein the step of calculating a historical probability that the text to be identified is junk text based on the message signature comprises:

Obtaining a history message including a message signature of the text to be recognized;

Calculate a ratio of a part of the historical messages determined as junk text to the number of the historical messages as the historical probability.
The method according to claim 1 or 4, wherein the step of using the second classification model to determine whether the text to be recognized is junk text according to the combination vector comprises:

Inputting the combination vector into the second classification model to obtain an output of the second classification model;

Determining whether the text to be recognized is junk text according to the output of the second classification model.
The method of claim 6, wherein the second classification model includes an integrated learning classification model, the integrated learning classification model includes a predetermined number of sub-classification models, and the combining vector is input into the second classification model The steps to obtain the output of the second classification model include:

The combination vector is input into each sub-classification model included in the ensemble learning classification model separately, so as to use a voting mechanism to determine the output of the second classification model according to the output of each sub-classification model.
The method according to claim 1 or 7, wherein the first classification model is obtained by training using a first training set with feature vectors as input, and the second classification model is obtained by combining a vector with a second training set. For the input training, the first training set and the second training set are sampled from the full training set, and the full training set includes multiple labeled samples, and the labels indicate whether the samples are garbage text.
The method according to claim 8, wherein the sub-classification model included in the second classification model is trained using a sub-training set corresponding to the sub-classification model, and the sub-training set is obtained from the second training set The training set was sampled evenly and replaced.
The method of claim 1, wherein the linear classification model is trained using L1 regularization, and the deep learning classification model is trained using a discard mechanism.
The method of claim 1, wherein the linear classification model comprises a logistic regression model and / or a support vector machine model, and the deep learning classification model comprises a convolutional neural network model and / or a recurrent neural network model.
The method of claim 7, wherein the integrated learning classification model comprises a random forest model or a gradient boosted decision tree model.
The method according to any one of claims 1-12, wherein the division result includes a plurality of division results, and for each division result, a feature vector is generated for the text to be recognized based on the division result, and the The feature vector inputs a plurality of first classification models corresponding to the division result to obtain outputs of the plurality of first classification models corresponding to the division result, and the plurality of corresponding to the division result The first classification model is obtained by training using a feature vector generated based on the division result as an input,

The step of combining at least the outputs of the plurality of first classification models to obtain a combination vector includes:

Combine at least the outputs of the first classification model corresponding to the division results to obtain the combination vector.
The method according to claim 13, wherein the step of performing text division on the text to be recognized to obtain multiple division results comprises:

Segmenting the text to be recognized based on a segmentation algorithm to obtain a segmentation result including a plurality of segmentations;

Text division is performed on the text to be recognized based on the n-gram language model, and a division result including multiple n-gram sequences is obtained.
The method according to claim 14, wherein the step of dividing the text to be recognized based on the n-gram language model to obtain a division result including a plurality of n-gram sequences comprises:

Text division of the text to be recognized based on a binary language model to obtain a division result including a plurality of binary sequences;

Text division is performed on the text to be recognized based on a ternary language model, and a division result including multiple ternary sequences is obtained.
A device for identifying junk text includes:

A text division unit, adapted to perform text division on a text to be recognized, and obtain a division result;

A feature learning unit, adapted to generate a feature vector for the text to be recognized based on the division result;

A first classification unit, adapted to input the feature vector into a plurality of first classification models to obtain outputs of the plurality of first classification models, the first classification model including a linear classification model and a deep learning classification model; and

A feature combination unit, adapted to combine at least the outputs of the plurality of first classification models to obtain a combination vector;

The second classification unit is adapted to determine whether the text to be recognized is junk text according to the combination vector by using a second classification model.
A computing device includes:

One or more processors;

Memory; and

One or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs include means for executing Instructions of any of the methods described in -15.
A readable storage medium storing a program, wherein the program includes instructions that, when executed by a computing device, cause the computing device to perform any of the methods of claims 1-15.