WO2020057413A1 - Junk text identification method and device, computing device and readable storage medium - Google Patents

Junk text identification method and device, computing device and readable storage medium Download PDF

Info

Publication number
WO2020057413A1
WO2020057413A1 PCT/CN2019/105348 CN2019105348W WO2020057413A1 WO 2020057413 A1 WO2020057413 A1 WO 2020057413A1 CN 2019105348 W CN2019105348 W CN 2019105348W WO 2020057413 A1 WO2020057413 A1 WO 2020057413A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
classification model
model
classification
vector
Prior art date
Application number
PCT/CN2019/105348
Other languages
French (fr)
Chinese (zh)
Inventor
高喆
康杨杨
周笑添
孙常龙
刘晓钟
司罗
Original Assignee
阿里巴巴集团控股有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴集团控股有限公司 filed Critical 阿里巴巴集团控股有限公司
Publication of WO2020057413A1 publication Critical patent/WO2020057413A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the invention relates to the field of artificial intelligence technology, and in particular, to a method, a device, a computing device, and a readable storage medium for identifying junk text.
  • unwanted users publish or send spam texts including spam such as pornographic information and uncivilized terms on the Internet, which seriously affects the healthy development of the Internet. Therefore, it is necessary to identify the junk text on the Internet in order to filter or delete the junk text.
  • embodiments of the present invention provide a method, a device, a computing device, and a readable storage medium for identifying junk text, in an effort to solve or at least alleviate at least one of the problems above.
  • a method for identifying junk text includes the steps of: dividing a text to be recognized to obtain a division result; generating a feature vector for the text to be recognized based on the division result; A vector is input to a plurality of first classification models to obtain outputs of the plurality of first classification models.
  • the first classification model includes a linear classification model and a deep learning classification model; at least the outputs of the plurality of first classification models are combined to obtain a combined vector. ; And based on the combined vector, a second classification model is used to determine whether the text to be recognized is junk text.
  • the step of generating a feature vector for the text to be identified based on the division result includes: at least a bag-of-words model, and generating a first vector of the text to be identified based on the division result. ; Using a word embedding model, a second vector of text to be recognized is generated based on the division result.
  • the step of inputting the feature vector into a plurality of first classification models includes: inputting a first vector generated based on the division result into a linear classification model; The second vector generated by the division result is input into a deep learning classification model.
  • the text to be identified is a message including a message signature
  • the method further includes the step of calculating a historical probability that the text to be identified is junk text based on the message signature.
  • at least combining the outputs of the plurality of first classification models to obtain a combination vector includes: combining the historical probability and the outputs of the plurality of first classification models to obtain a combination vector.
  • the step of calculating a historical probability that the text to be identified is junk text based on the message signature includes: obtaining a historical message including a message signature including the text to be identified; and calculating the history The ratio of the part of the historical message in the message determined as junk text to the number of the historical message is used as the historical probability.
  • the step of using the second classification model to determine whether the text to be identified is junk text according to the combination vector includes: entering the combination vector into the second classification model, Get the output of the second classification model; determine whether the text to be recognized is junk text according to the output of the second classification model.
  • the second classification model includes an integrated learning classification model, and the integrated learning classification model includes a predetermined number of sub-classification models, and a combination vector is input to the second classification model to obtain
  • the output of the second classification model includes: inputting the combination vector into each sub-classification model included in the integrated learning classification model separately, so as to adopt a voting mechanism to determine the output of the second classification model according to the output of each sub-classification model.
  • the first classification model is obtained by training using the first training set and using feature vectors as input, and the second classification model is using the second training set by combining The vectors are obtained by input training.
  • the first training set and the second training set are sampled from the full training set.
  • the full training set includes multiple labeled samples, and the labels indicate whether the samples are junk text.
  • the sub-classification model included in the integrated learning model is trained using a sub-training set corresponding to the sub-classification model, and the sub-training set is obtained from the second training set Sampling evenly replaced.
  • the linear classification model is trained using L1 regularization, and the deep learning classification model is trained using a discard mechanism.
  • the linear classification model includes a logistic regression model and / or a support vector machine model
  • the deep learning classification model includes a convolutional neural network model and / or a recurrent neural network model .
  • the integrated learning classification model includes a random forest model or a gradient boosting decision tree model.
  • the division result includes multiple division results.
  • a feature vector is generated for the text to be identified based on the division result, and the feature vector is inputted.
  • a plurality of first classification models corresponding to the division result to obtain outputs of the plurality of first classification models corresponding to the division result, and a plurality of first classification models corresponding to the division result to be based on the division result
  • the generated feature vector is obtained by input training, and at least the outputs of multiple first classification models are combined.
  • the step of obtaining a combined vector includes: at least combining the outputs of the first classification model corresponding to the respective division results to obtain a combined vector. .
  • the text segmentation to be recognized to obtain multiple division results includes: text segmentation of the text to be identified based on a segmentation algorithm to obtain a plurality of segmented text. Segmentation results; text division is performed on the text to be recognized based on the n-gram language model, and a division result including a plurality of n-grams is obtained.
  • the text division of the text to be recognized based on the n-gram language model to obtain a division result including a plurality of n-gram sequences includes: Recognize text and divide the text to obtain a division result that includes multiple binary sequences. Based on the ternary language model, divide the text to be recognized to obtain a division result that includes multiple trigrams.
  • a device for identifying junk text including: a text division unit adapted to perform text division on a text to be identified to obtain a division result; and a feature learning unit adapted to be based on the division result as
  • the to-be-recognized text generates a feature vector;
  • a first classification unit is adapted to input the feature vector into multiple first classification models to obtain multiple first classification models output, the first classification model includes a linear classification model and a deep learning classification model ;
  • a feature combination unit adapted to combine at least the outputs of multiple first classification models to obtain a combination vector;
  • a second classification unit adapted to use the second classification model to determine whether the text to be identified is junk text based on the combination vector .
  • a computing device including: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be implemented by One or more processors execute, and the one or more programs include instructions for performing any one of the methods for identifying junk text according to an embodiment of the present invention.
  • a readable storage medium storing a program
  • the program includes instructions that, when executed by a computing device, cause the computing device to execute junk text according to an embodiment of the present invention. Any of the identification methods.
  • the method for identifying junk text according to the embodiment of the present invention is based on a stacking algorithm, and integrates a plurality of first classification models through a second classification model to obtain classification results. Combining the advantages of multiple types of first classification models greatly The recognition ability of junk text is improved, and the performance of the model is better.
  • the integrated learning classification model based on the bagging algorithm is used as the second classification model, and the advantages of multiple sub-classification models included in the integrated learning classification model are further improved. Recognition of junk text while preventing overfitting of the model.
  • the error of the single division result passed to the classification model is compensated, and the influencing factors of the classification results of the text to be identified on the classification are described at multiple granularities, which further improves the garbage text. Recognition.
  • the L1 regularization term of the linear classification model ensures the sparseness of the features, thereby ensuring better spam text recognition, and further prevents the model from overfitting through the discard mechanism of the deep learning classification model.
  • FIG. 1 shows a schematic diagram of a junk text recognition system 100 according to an embodiment of the present invention
  • FIG. 2 shows a structural block diagram of a device 200 for identifying junk text according to an embodiment of the present invention
  • FIG. 4 shows a flowchart of a method 400 for identifying junk text according to an embodiment of the present invention.
  • FIG. 5 shows a structural block diagram of a junk text recognition device 500 according to an embodiment of the present invention.
  • FIG. 1 shows a schematic diagram of a junk text recognition system 100 according to an embodiment of the present invention.
  • the junk text recognition system 100 includes a recognition device 200 that recognizes a front end 110 and junk text.
  • Spam text here refers to abnormal text, or text that includes spam.
  • a typical junk text is text that includes pornographic information. The following is an example of junk text:
  • the identification front end 110 is any requester that needs to determine whether the text to be identified is junk text.
  • the processing front end 110 may be part of an instant messaging system.
  • the instant messaging system can receive messages entered by the user. If the message is junk text, the instant messaging system needs to intercept the message. If the message is not junk text, the instant messaging system can send the message. Therefore, the recognition front end 110 sends the message to the junk text recognition device 200 to determine the junk text to determine whether the message is junk text.
  • the identification front end 110 may also be part of a review review system.
  • the review moderation system can receive reviews entered by the user. If the comment is spam, the comment review system needs to block the comment. If the comment is not spam, the comment review system can post the comment. In this case, the recognition front end 110 may send a spam text recognition request including the comment to the spam text recognition device 200 for processing.
  • the junk text recognition device 200 receives the request, obtains the text to be recognized from the request, and determines whether the text to be recognized is junk text.
  • FIG. 2 shows a structural block diagram of a device 200 for identifying junk text according to an embodiment of the present invention.
  • the junk text recognition device 200 includes a text division unit 210, a feature learning unit 220, a first classification unit 230, a feature combination unit 240, and a second classification unit 250.
  • the junk text recognition system 100 further includes a full training set 130.
  • the first classification model and the second classification model may be trained using the samples in the full training set 130.
  • the full training set 130 includes a plurality of labeled samples indicating whether the samples are junk text. Samples usually include positive and negative samples. A label of a positive sample indicates that the sample is junk text, and a label of a negative sample indicates that the sample is not junk text. It is necessary to include a predetermined proportion of positive samples and negative samples in the full training set 130, so that the first classification model and the second classification model can be more comprehensively trained.
  • the first classification model is obtained by training using the first training set and using the above feature vector as an input.
  • the second classification model is obtained by using the second training set and the above combination vector as input training, and the first training set and the second training set are both sampled from the full training set.
  • the first training set is not equal to the second training set. That is, the first training set and the second training set may have intersections, but they may not be equivalent. Of course, the first training set and the second training set may also have no intersection.
  • the linear classification model in order to ensure the sparseness of the features, can be trained using L1 regularization, that is, the loss function of the linear classification model can have the L1 regularization term (that is, L1 norm).
  • the deep learning classification model in order to prevent the model from overfitting, can be trained using a dropout mechanism, that is, during the training process, the deep learning classification model can discard some neurons according to a predetermined discarding ratio.
  • the second classification model may include an ensemble learning classification model, which is a ensemble learning classification model based on Bootstrap Aggregating (bagging algorithm for short).
  • the ensemble learning classification model includes a predetermined number of sub-classification models.
  • a predetermined number of sub-training sets of the same size can be sampled from the second data set uniformly and reproducibly (that is, using a self-service sampling method) to train a predetermined number of sub-classification models.
  • the predetermined number of sub-training sets correspond one-to-one with the predetermined number of sub-classification models. That is, the sub-classification model is trained using a sub-training set corresponding to the sub-classification model.
  • the second classification unit 250 may input the above combination vectors into each sub-classification model included in the integrated learning classification model separately, so as to adopt a voting mechanism or take an average, and determine the first according to the output of each sub-classification model. Output of the binary classification model. Then, it is determined whether the text to be recognized is junk text according to the output of the second classification model.
  • the division result may include multiple division results.
  • a plurality of first classification models corresponding to the segmentation result may be obtained by training the feature vector generated based on the segmentation result as input.
  • the feature learning unit 220 may generate a feature vector for the text to be recognized for each division result based on the division result.
  • the first classification unit 230 may, for each division result, input a feature vector based on the division result into a plurality of first classification models corresponding to the division result, so as to obtain a plurality of first classification models corresponding to the division result.
  • Output. Therefore, at least combining the outputs of multiple first classification models to obtain a combination vector may include: at least combining the outputs of the first classification models corresponding to the division results to obtain the combination vector.
  • the device 200 for identifying junk text may further include a historical probability calculation unit 260 (not shown in FIG. 1).
  • the historical probability calculation unit 260 is adapted to calculate a historical probability that the text to be identified is junk text based on the message signature. Accordingly, the historical probability and the outputs of all the first classification models can be combined to obtain the above-mentioned combination vector.
  • the message refers to the text sent from one party (ie, the message sender) to the other party (ie, the message receiver), and includes the message signature.
  • the message signature is used to uniquely identify the sender of the message, which can usually be the company name, brand name, project name, or application name.
  • the message signature is usually located at the beginning of the message, and is separated from other contents by a separator such as "[]". The following is an example of the message: "[XX Takeaway] Your takeaway has been delivered.”. Among them, "XX Takeaway” is the message signature of the message.
  • FIG. 3 shows a schematic diagram of a computing device 300 according to one embodiment of the invention.
  • the computing device 300 typically includes a system memory 306 and one or more processors 304.
  • the memory bus 308 may be used for communication between the processor 304 and the system memory 306.
  • the processor 304 may be any type of processing, including but not limited to: a microprocessor ( ⁇ P), a microcontroller ( ⁇ C), a digital information processor (DSP), or any combination thereof.
  • the processor 304 may include one or more levels of cache, such as a primary cache 310 and a secondary cache 312, a processor core 314, and a register 316.
  • the example processor core 314 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP core), or any combination thereof.
  • An example memory controller 318 may be used with the processor 304, or in some implementations, the memory controller 318 may be an internal part of the processor 304.
  • the computing device 300 may further include a storage device 332 and a storage interface bus 334 in which the removable storage 336 and the non-removable storage 338 may be controlled.
  • the computing device 300 may also include an interface bus 340 that facilitates communication from various interface devices (eg, output devices 342, peripheral interfaces 344, and communication devices 346) to the basic configuration 302 via the bus / interface controller 330.
  • the example output device 342 includes a graphics processing unit 348 and an audio processing unit 350. They can be configured to facilitate communication with various external devices such as a display or speakers via one or more A / V ports 352.
  • the example peripheral interface 344 may include a serial interface controller 354 and a parallel interface controller 356, which may be configured to facilitate communication via one or more I / O ports 358 and such as input devices (e.g., keyboard, mouse, pen , Voice input devices, touch input devices) or other peripheral devices (such as printers, scanners, etc.) to communicate.
  • An example communication device 346 may include a network controller 360, which may be arranged to facilitate communication with one or more other computing devices 362 over a network communication link via one or more communication ports 364.
  • a network communication link may be one example of a communication medium.
  • Communication media may typically be embodied as computer-readable instructions, data structures, program modules in a modulated data signal such as a carrier wave or other transmission mechanism, and may include any information delivery media.
  • a "modulated data signal" can be a signal in which one or more of its data sets or its changes can be made in a manner that encodes information in the signal.
  • communication media may include wired media such as a wired network or a dedicated network, and various wireless media such as sound, radio frequency (RF), microwave, infrared (IR), or other wireless media.
  • RF radio frequency
  • IR infrared
  • the term computer-readable media as used herein may include both storage media and communication media.
  • the computing device 300 may be implemented as a server, such as a database server, an application server, and a WEB server, and may also be implemented as a personal computer including a desktop computer and a notebook computer configuration. Of course, the computing device 300 may also be implemented as part of a small-sized portable (or mobile) electronic device.
  • the computing device 300 is implemented as a device 200 for identifying junk text, and is configured to perform a method 400 for identifying a junk text according to an embodiment of the present invention.
  • the application 322 of the computing device 300 includes multiple program instructions for executing the method 400 for identifying junk text according to an embodiment of the present invention, and the program data 324 may further store configuration information of the device 200 for identifying junk text.
  • FIG. 4 shows a flowchart of a method 400 for identifying junk text according to an embodiment of the present invention. As shown in FIG. 4, the method 400 for identifying junk text starts at step S410.
  • step S410 text division is performed on the text to be recognized to obtain a division result.
  • Any text division method in the art can be used to perform text division on the text to be recognized.
  • a text segmentation algorithm may be used to perform text segmentation on the text to be recognized, and a segmentation result including multiple segmentations may be obtained.
  • the invention does not limit the specific word segmentation algorithm.
  • the word segmentation algorithm was used to divide the text into “the weather is very good today", and the results of the segmentation including the words “today”, “weather”, “very” and “good” were obtained.
  • the n-gram language model is a probabilistic language model based on the (n-1) -order Markov chain, which infers the structure of a sentence by the probability of the occurrence of n items.
  • An n-gram is a continuous sequence of n items from a given text. Items can usually be phonemes, syllables, letters, words, words. In the embodiment of the present invention, the n-gram is a continuous sequence of n words from the text to be recognized.
  • Items can usually be phonemes, syllables, letters, words, words.
  • the n-gram is a continuous sequence of n words from the text to be recognized.
  • a feature vector is generated for the text to be recognized based on the division result.
  • the feature vector may include a first vector and a second vector.
  • a bag of words model can be used at least, and a first vector of text to be recognized is generated based on the division result.
  • a bag-of-words model may be adopted first, and a bag-of-words vector of text to be recognized is generated based on the division result.
  • a feature extraction method such as a word frequency-inverse document frequency (TF-IDF) algorithm and a mutual information algorithm is used to process the bag of words vector to obtain a first vector.
  • TF-IDF word frequency-inverse document frequency
  • a word embedding model may be adopted, and a second vector, that is, a word vector, of the text to be recognized is generated based on the division result.
  • the present invention does not limit the specific word embedding model.
  • a Skip-Gram model or a CBOW (continuous bags of word) model can be used.
  • step S430 the feature vectors are input to a plurality of first classification models to obtain outputs of the plurality of first classification models.
  • the first classification model may include two types of models, a linear classification model and a deep learning classification model
  • the first vector generated based on the above-mentioned division result may be input into the linear classification model
  • the second vector generated based on the above-mentioned division result may be input into the depth Learn classification models.
  • step S440 After obtaining the outputs of the plurality of first classification models, in step S440, at least the outputs of the plurality of first classification models are combined to obtain a combination vector.
  • at least the outputs of multiple first classification models can be stitched to obtain a combined vector.
  • the text to be recognized may be a message introduced in the foregoing, and the message includes a message signature. Therefore, the method 400 for identifying junk text may further include the step of calculating a historical probability that the text to be identified is junk text based on the message signature. Specifically, a historical message including a message signature of the text to be recognized may be obtained (which may be obtained from a preset historical message database). Calculate the ratio of the part of the obtained historical messages determined as junk text to the number of acquired historical messages as the historical probability that the text to be identified is junk text.
  • the historical probability of the text to be recognized as junk text can be combined with the output of all the first classification models to obtain the above-mentioned combination vector.
  • the risk of the text to be identified as junk text can be evaluated from various dimensions.
  • a second classification model may be used to determine whether the text to be recognized is junk text. That is, the combination vector can be input to the second classification model to obtain the output of the second classification model. Then, it is determined whether the text to be recognized is junk text according to the output of the second classification model.
  • the first classification model is obtained by training using the first training set and using the feature vector as an input
  • the second classification model is obtained by using the second training set and using the combination vector as the input.
  • the first training set and the second training set are both sampled from the full training set.
  • the full training set includes multiple labeled samples that indicate whether the samples are junk text. Samples usually include positive and negative samples. A label of a positive sample indicates that the sample is junk text, and a label of a negative sample indicates that the sample is not junk text. It is necessary to include a predetermined proportion of positive samples and negative samples in the full training set 130, so that the first classification model and the second classification model can be more comprehensively trained.
  • the first training set is not equal to the second training set. That is, the first training set and the second training set may have intersections, but they may not be equivalent. Of course, the first training set and the second training set may also have no intersection.
  • the linear classification model can be trained using L1 regularization to ensure the sparseness of features. That is, the loss function of the linear classification model may have an L1 regularization term (ie, L1 norm).
  • the linear classification model may include a logistic regression model and / or a support vector machine (SVM) model, and the logistic regression model may be trained using L1 regularization.
  • SVM support vector machine
  • the present invention does not limit the specific linear classification model.
  • the logistic regression model and the support vector machine model it can also be other linear classification models.
  • the deep learning classification models can be trained using the discard mechanism to prevent overfitting. That is, the deep learning classification model can discard some neurons according to a predetermined discarding ratio during the training process.
  • the deep learning classification model may include a convolutional neural network model (CNN) and / or a recurrent neural network model (RNN).
  • CNN convolutional neural network model
  • RNN recurrent neural network model
  • the invention does not limit the specific deep learning classification model.
  • it can also be other deep learning classification models.
  • the sub-classification model included in the ensemble learning model is trained using a sub-training set corresponding to the sub-classification model, and the sub-training set is sampled from the second training set evenly and reproducibly.
  • a predetermined number of sub-training sets of the same size can be sampled from the second data set uniformly and with replacement (that is, using a self-service sampling method) to train a predetermined number of sub-classification models.
  • the predetermined number of sub-training sets correspond one-to-one with the predetermined number of sub-classification models.
  • the invention does not limit the specific ensemble learning model.
  • the ensemble learning classification model may be, for example, a random forest model or a gradient boosted decision tree model (GBDT model), where the sub-classification model is a decision tree.
  • the predetermined number can usually take the value 100.
  • the division result including multiple n-ary sequences may include the division result of multiple binary sequences and the division result including multiple ternary sequences, which are respectively text division based on the binary language model and the ternary language model to be recognized. And get.
  • a feature vector can be generated for the text to be recognized based on the division result, and the feature vector generated based on the division result is input to a plurality of first classification models corresponding to the division result to obtain the division result. Outputs of corresponding first classification models. In this way, at least the outputs of multiple first classification models are combined to obtain a combined vector.
  • At least the outputs of the first classification model corresponding to the division results are combined to obtain a combination vector.
  • different division results correspond to different L1 regularization terms.
  • a linear classification model corresponding to a segmentation result including a plurality of word segmentation usually has a high accuracy rate and a low recall rate, so the L1 regularization term to be added is small.
  • the linear classification model corresponding to the division results including multiple bigrams usually has low accuracy and high recall, so the L1 regularization term that needs to be added is large.
  • the linear classification model corresponding to the division results including multiple trigrams has moderate accuracy and recall. However, the scale of feature vectors generated based on the division results including multiple trigrams is very high, so it is necessary to add a larger L1 regularization term.
  • the L1 regularization term corresponding to the segmentation result including multiple segmentations is the smallest, the L1 regularization term corresponding to the segmentation result including multiple trigrams is centered, and the L1 regularization corresponding to the segmentation result including multiple bigrams.
  • the item is the largest.
  • different division results correspond to different discarding ratios. Similar to the L1 regularization term, in the embodiment of the present invention, the L1 regularization term corresponding to the segmentation result including multiple segmentations is the smallest, and the L1 regularization term corresponding to the segmentation result including multiple trigrams is centered, including multiple bigrams. The L1 regularization term corresponding to the division result is the largest.
  • FIG. 5 shows a structural block diagram of a junk text recognition device 500 according to an embodiment of the present invention.
  • the junk text recognition device 500 may include a first text dividing unit 510, a second text dividing unit 512, a third text dividing unit 514, a first feature learning unit 520, a second feature learning unit 522, a first A basic classification unit 530, a second basic classification unit 532, a third basic classification unit 534, an output combining unit 540, and an integrated classification unit 550.
  • the first text division unit 510 is adapted to perform text division on the text to be recognized based on the word segmentation algorithm, to obtain a first division result including a plurality of word segmentation.
  • the second text division unit 512 is adapted to perform text division on the text to be recognized based on the binary language model, and obtain a second division result including a plurality of binary sequences.
  • the third text division unit 514 is adapted to perform text division on the text to be recognized based on the ternary language model, and obtain a third division result including a plurality of ternary sequences.
  • the first feature learning unit 520 is connected to the first text division unit 510, the second text division unit 512, and the third text division unit 514, respectively, and is adapted to use at least a bag-of-words model to generate the text to be recognized based on the first division result.
  • the second feature learning unit 522 is connected to the first text division unit 510, the second text division unit 512, and the third text division unit 514, respectively, and is adapted to use a word embedding model to generate a first text segment for the text to be recognized based on the first segmentation result. Two vectors, generating a second vector for the text to be recognized based on the second division result, and generating a second vector for the text to be recognized based on the third division result.
  • the first basic classification unit 530 is connected to the first feature learning unit 520 and the second feature learning unit 522, respectively, and is adapted to respectively input first vectors generated based on the first division result into the first corresponding to the first division result.
  • the logistic regression model and the first support vector machine model input a second vector generated based on the first partition result into a first convolutional neural network model corresponding to the first partition result to obtain a first logistic regression model, a first support The output of the vector machine model and the first convolutional neural network model.
  • the second basic classification unit 532 is connected to the first feature learning unit 520 and the second feature learning unit 522, respectively, and is adapted to respectively input the first vectors generated based on the second division result into the second corresponding to the second division result.
  • Logistic regression model and second support vector machine model The second vector generated based on the second partition result is input to a second convolutional neural network model corresponding to the second partition result to obtain a second logistic regression model, a second support The output of the vector machine model and the second convolutional neural network model.
  • the third basic classification unit 534 is connected to the first feature learning unit 520 and the second feature learning unit 522, respectively, and is adapted to respectively input the first vectors generated based on the third division result into the third corresponding to the third division result.
  • the second vector generated based on the third partition result is input to a third convolutional neural network model corresponding to the third partition result to obtain a third logistic regression model and a third support. Output of the vector machine model and the third convolutional neural network model.
  • the output combining unit 540 is connected to the first basic classification unit 530, the second basic classification unit 532, and the third basic classification unit 534, respectively, and is adapted to integrate the first logistic regression model, the first support vector machine model, and the first convolution. Combining the output of a neural network model, a second logistic regression model, a second support vector machine model, a second convolutional neural network model, a third logistic regression model, a third support vector machine model, and a third convolutional neural network model, Get the combination vector.
  • the output combining unit 540 may use the historical probability of the text to be recognized as junk text, a first logistic regression model, and a first support vector machine. Model, first convolutional neural network model, second logistic regression model, second support vector machine model, second convolutional neural network model, third logistic regression model, third support vector machine model, and third convolutional neural network The outputs of the models are combined to obtain a combined vector.
  • the integrated classification unit 550 is connected to the output combination unit 540, and is suitable for inputting the combination vector into each decision tree included in the random forest model, so as to use a voting mechanism to determine the output of the random forest model according to the output of each decision tree. Finally, according to the output of the random forest model, it is judged whether the text to be identified is junk text.
  • the method for identifying junk text according to the embodiment of the present invention is based on a stacking algorithm, and integrates a plurality of first classification models through a second classification model to obtain classification results.
  • the advantages of combining multiple types of first classification models are combined. , Which greatly improves the recognition ability of junk text, and the performance of the model is better.
  • a guided bagging algorithm is combined on the basis of the stacking algorithm, and an integrated learning classification model based on the bagging algorithm is used as the second classification model to further improve the recognition ability of junk text while preventing the model from overfitting.
  • the method also obtains multiple division results to make up for the error of the single division result passed to the classification model.
  • the influencing factors of the text features of the text to be identified on the classification are described at multiple granularities to further improve the junk text. Recognition ability.
  • the L1 regularization term of the linear classification model also ensures the sparseness of the features, thereby ensuring better spam text recognition.
  • the discard mechanism of the deep learning classification model further prevents the model from overfitting.
  • the proportion of messages containing pornographic information relative to normal messages is extremely low, usually above 1: 10000.
  • the variance is large, the types of messages covered are very wide, and the expression forms of pornographic messages are also extremely complicated.
  • there are many variants of pornographic messages which are very vague.
  • the traditional spam text recognition scheme has limited recognition capabilities. For example, in a case where only the foregoing first division result, the first vector, and the support vector machine model are used, the F1 value of the model is 0.954. In the case where only the aforementioned second division result, the first vector, and the support vector machine model are used, the F1 value of the model is 0.961.
  • the F1 value of the model is 0.971.
  • the model's F1 value is 0.987.
  • the F1 value is Fscore, which is the harmonic mean of model accuracy and recall. Generally, the larger the F1 value, the better the performance of the model.
  • the performance of the model is better and superior, and the recognition is more accurate.
  • the various techniques described herein may be implemented in conjunction with hardware or software, or a combination thereof.
  • the method and apparatus of the present invention may take program code embedded in a tangible medium, such as a floppy disk, CD-ROM, hard drive, or any other machine-readable storage medium (I.e., instructions) in which when a program is loaded into and executed by a machine such as a computer, the machine becomes a device for practicing the present invention.
  • a tangible medium such as a floppy disk, CD-ROM, hard drive, or any other machine-readable storage medium (I.e., instructions) in which when a program is loaded into and executed by a machine such as a computer, the machine becomes a device for practicing the present invention.
  • the computing device generally includes a processor, a processor-readable storage medium (including volatile and non-volatile memory and / or storage elements), at least one input device, And at least one output device.
  • the memory is configured to store program code; the processor is configured to execute various methods of the present invention according to instructions in the program code stored in the memory.
  • Computer-readable media includes computer storage media and communication media.
  • Computer-readable media includes computer storage media and communication media.
  • the computer storage medium stores information such as computer-readable instructions, data structures, program modules, or other data.
  • Communication media typically embodies computer-readable instructions, data structures, program modules, or other data with a modulated data signal such as a carrier wave or other transmission mechanism, and includes any information delivery media. Combinations of any of the above are also included within the scope of computer-readable media.
  • modules or units or components of the device in the example disclosed herein may be arranged in the device as described in this embodiment, or alternatively may be positioned differently from the device in this example Of one or more devices.
  • the modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
  • modules in the device in the embodiment can be adaptively changed and set in one or more devices different from the embodiment.
  • the modules or units or components in the embodiment may be combined into one module or unit or component, and furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Except for such features and / or processes or units, which are mutually exclusive, all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any methods so disclosed may be employed in any combination or All processes or units of the equipment are combined.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Disclosed by the present invention is a junk text identification method and device, a computing device and a readable storage medium. One method embodiment comprises the steps of: dividing a text to be recognized so as to obtain a division result; generating a feature vector for the text to be recognized on the basis of the division result; inputting the feature vector into a plurality of first classification models so as to obtain outputs of the plurality of first classification models, the first classification model comprising a linear classification model and a deep learning classification model; at least combining the outputs of the plurality of first classification models, so as to obtain a combined vector; and using a second classification model to determine whether the text to be recognized is junk text according to the combined vector. Also disclosed by the present invention are a corresponding junk text identification device, a computing device and a readable storage medium.

Description

垃圾文本的识别方法、装置、计算设备及可读存储介质Method, device, computing device and readable storage medium for identifying junk text
本申请要求2018年09月17日递交的申请号为201811083369.8、发明名称为“垃圾文本的识别方法、装置、计算设备及可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed on September 17, 2018 with an application number of 201811083369.8 and an invention name of "Recognition Method, Device, Computing Device, and Readable Storage Medium for Junk Text", the entire contents of which are incorporated by reference In this application.
技术领域Technical field
本发明涉及人工智能技术领域,尤其涉及垃圾文本的识别方法、装置、计算设备及可读存储介质。The invention relates to the field of artificial intelligence technology, and in particular, to a method, a device, a computing device, and a readable storage medium for identifying junk text.
背景技术Background technique
随着互联网技术的发展和普及,越来越多的文档和对话都是以电子化的形式在网络中存储和使用。为了对文档和对话内容进行处理,自然语言处理技术日益普及。在自然语言处理领域,垃圾文本识别的问题日益受到重视。With the development and popularization of Internet technology, more and more documents and conversations are stored and used on the network in electronic form. In order to process documents and conversation content, natural language processing technology is becoming increasingly popular. In the field of natural language processing, the issue of spam text recognition is receiving increasing attention.
具体来说,不良用户在互联网上发布或者经由互联网发送包括诸如色情信息和不文明用语等等之类的垃圾信息的垃圾文本,给互联网的健康发展带来严重的不利影响。因此,有必要对互联网上的垃圾文本进行识别,以便对垃圾文本进行过滤或者删除。Specifically, unwanted users publish or send spam texts including spam such as pornographic information and uncivilized terms on the Internet, which seriously affects the healthy development of the Internet. Therefore, it is necessary to identify the junk text on the Internet in order to filter or delete the junk text.
在目前的垃圾文本识别方案中,利用传统的特征提取方式(例如词袋模型)和传统的机器学习(例如支持向量机)分类模型的方案识别效果和语义表达能力均较差,而利用词嵌入算法和深度学习(例如神经网络)模型的方案则需要大量的训练数据,并且模型过于复杂,极容易过拟合。In current spam text recognition schemes, traditional feature extraction methods (such as bag-of-words models) and traditional machine learning (such as support vector machines) classification models have poor recognition and semantic expression capabilities, and use word embeddings. Algorithms and deep learning (such as neural network) models require a large amount of training data, and the models are too complex and extremely easy to overfit.
因此,迫切需要一种更先进的垃圾文本的识别方案。Therefore, a more advanced garbage text recognition scheme is urgently needed.
发明内容Summary of the Invention
为此,本发明实施例提供一种垃圾文本的识别方法、装置、计算设备及可读存储介质,以力图解决或者至少缓解上面存在的至少一个问题。To this end, embodiments of the present invention provide a method, a device, a computing device, and a readable storage medium for identifying junk text, in an effort to solve or at least alleviate at least one of the problems above.
根据本发明实施例的一个方面,提供了一种垃圾文本的识别方法,该方法包括步骤:对待识别文本进行文本划分,得到划分结果;基于该划分结果为待识别文本生成特征向量;将该特征向量输入多个第一分类模型,以得到多个第一分类模型的输出,第一分类模型包括线性分类模型和深度学习分类模型;至少对多个第一分类模型的输出进行组合,得到组合向量;以及根据组合向量,采用第二分类模型来判断待识别文本是否为垃圾文 本。According to an aspect of the embodiment of the present invention, a method for identifying junk text is provided. The method includes the steps of: dividing a text to be recognized to obtain a division result; generating a feature vector for the text to be recognized based on the division result; A vector is input to a plurality of first classification models to obtain outputs of the plurality of first classification models. The first classification model includes a linear classification model and a deep learning classification model; at least the outputs of the plurality of first classification models are combined to obtain a combined vector. ; And based on the combined vector, a second classification model is used to determine whether the text to be recognized is junk text.
可选地,在根据本发明实施例的垃圾文本的识别方法中,基于划分结果为待识别文本生成特征向量的步骤包括:至少采用词袋模型,基于该划分结果生成待识别文本的第一向量;采用词嵌入模型,基于该划分结果生成待识别文本的第二向量。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the step of generating a feature vector for the text to be identified based on the division result includes: at least a bag-of-words model, and generating a first vector of the text to be identified based on the division result. ; Using a word embedding model, a second vector of text to be recognized is generated based on the division result.
可选地,在根据本发明实施例的垃圾文本的识别方法中,将该特征向量输入多个第一分类模型的步骤包括:将基于该划分结果生成的第一向量输入线性分类模型;将基于该划分结果生成的第二向量输入深度学习分类模型。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the step of inputting the feature vector into a plurality of first classification models includes: inputting a first vector generated based on the division result into a linear classification model; The second vector generated by the division result is input into a deep learning classification model.
可选地,在根据本发明实施例的垃圾文本的识别方法中,待识别文本为消息,该消息包括消息签名,该方法还包括步骤:基于消息签名,计算待识别文本为垃圾文本的历史概率;相应地,至少对多个第一分类模型的输出进行组合,得到组合向量的步骤包括:对历史概率和多个第一分类模型的输出进行组合,得到组合向量。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the text to be identified is a message including a message signature, and the method further includes the step of calculating a historical probability that the text to be identified is junk text based on the message signature. Correspondingly, at least combining the outputs of the plurality of first classification models to obtain a combination vector includes: combining the historical probability and the outputs of the plurality of first classification models to obtain a combination vector.
可选地,在根据本发明实施例的垃圾文本的识别方法中,基于消息签名,计算待识别文本为垃圾文本的历史概率的步骤包括:获取包括待识别文本的消息签名的历史消息;计算历史消息中确定为垃圾文本的部分历史消息与该历史消息的数量之比,以作为历史概率。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the step of calculating a historical probability that the text to be identified is junk text based on the message signature includes: obtaining a historical message including a message signature including the text to be identified; and calculating the history The ratio of the part of the historical message in the message determined as junk text to the number of the historical message is used as the historical probability.
可选地,在根据本发明实施例的垃圾文本的识别方法中,根据组合向量,采用第二分类模型来判断待识别文本是否为垃圾文本的步骤包括:将组合向量输入第二分类模型,以得到第二分类模型的输出;根据第二分类模型的输出来判断待识别文本是否为垃圾文本。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the step of using the second classification model to determine whether the text to be identified is junk text according to the combination vector includes: entering the combination vector into the second classification model, Get the output of the second classification model; determine whether the text to be recognized is junk text according to the output of the second classification model.
可选地,在根据本发明实施例的垃圾文本的识别方法中,第二分类模型包括集成学习分类模型,集成学习分类模型包括预定数目个子分类模型,将组合向量输入第二分类模型,以得到第二分类模型的输出的步骤包括:将组合向量分别输入集成学习分类模型所包含的每个子分类模型,以便采用投票机制,根据每个子分类模型的输出来确定第二分类模型的输出。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the second classification model includes an integrated learning classification model, and the integrated learning classification model includes a predetermined number of sub-classification models, and a combination vector is input to the second classification model to obtain The output of the second classification model includes: inputting the combination vector into each sub-classification model included in the integrated learning classification model separately, so as to adopt a voting mechanism to determine the output of the second classification model according to the output of each sub-classification model.
可选地,在根据本发明实施例的垃圾文本的识别方法中,第一分类模型是利用第一训练集、以特征向量为输入训练得到,第二分类模型是利用第二训练集、以组合向量为输入训练得到,第一训练集和第二训练集是从全训练集中抽样得到,全训练集包括多个标注有标签的样本,标签指示样本是否为垃圾文本。Optionally, in the method for recognizing junk text according to the embodiment of the present invention, the first classification model is obtained by training using the first training set and using feature vectors as input, and the second classification model is using the second training set by combining The vectors are obtained by input training. The first training set and the second training set are sampled from the full training set. The full training set includes multiple labeled samples, and the labels indicate whether the samples are junk text.
可选地,在根据本发明实施例的垃圾文本的识别方法中,集成学习模型所包含的子分类模型利用与子分类模型相对应的子训练集训练得到,子训练集是从第二训练集中均 匀有放回地抽样得到。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the sub-classification model included in the integrated learning model is trained using a sub-training set corresponding to the sub-classification model, and the sub-training set is obtained from the second training set Sampling evenly replaced.
可选地,在根据本发明实施例的垃圾文本的识别方法中,线性分类模型利用L1正则化来进行训练,深度学习分类模型利用丢弃机制来进行训练。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the linear classification model is trained using L1 regularization, and the deep learning classification model is trained using a discard mechanism.
可选地,在根据本发明实施例的垃圾文本的识别方法中,线性分类模型包括逻辑回归模型和/或支持向量机模型,深度学习分类模型包括卷积神经网络模型和/或循环神经网络模型。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the linear classification model includes a logistic regression model and / or a support vector machine model, and the deep learning classification model includes a convolutional neural network model and / or a recurrent neural network model .
可选地,在根据本发明实施例的垃圾文本的识别方法中,集成学习分类模型包括随机森林模型或梯度提升决策树模型。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the integrated learning classification model includes a random forest model or a gradient boosting decision tree model.
可选地,在根据本发明实施例的垃圾文本的识别方法中,划分结果包括多个划分结果,对于各划分结果,均基于该划分结果为待识别文本生成特征向量,并将该特征向量输入与该划分结果相对应的多个第一分类模型,以得到与该划分结果相对应的多个第一分类模型的输出,与该划分结果相对应的多个第一分类模型以基于该划分结果生成的特征向量为输入训练得到,至少对多个第一分类模型的输出进行组合,得到组合向量的步骤包括:至少对与各划分结果相对应的第一分类模型的输出进行组合,得到组合向量。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the division result includes multiple division results. For each division result, a feature vector is generated for the text to be identified based on the division result, and the feature vector is inputted. A plurality of first classification models corresponding to the division result to obtain outputs of the plurality of first classification models corresponding to the division result, and a plurality of first classification models corresponding to the division result to be based on the division result The generated feature vector is obtained by input training, and at least the outputs of multiple first classification models are combined. The step of obtaining a combined vector includes: at least combining the outputs of the first classification model corresponding to the respective division results to obtain a combined vector. .
可选地,在根据本发明实施例的垃圾文本的识别方法中,对待识别文本进行文本划分,得到多个划分结果的步骤包括:基于分词算法对待识别文本进行文本划分,得到包括多个分词的划分结果;基于n元语言模型对待识别文本进行文本划分,得到包括多个n元序列(n-gram)的划分结果。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the text segmentation to be recognized to obtain multiple division results includes: text segmentation of the text to be identified based on a segmentation algorithm to obtain a plurality of segmented text. Segmentation results; text division is performed on the text to be recognized based on the n-gram language model, and a division result including a plurality of n-grams is obtained.
可选地,在根据本发明实施例的垃圾文本的识别方法中,基于n元语言模型对待识别文本进行文本划分,得到包括多个n元序列的划分结果的步骤包括:基于二元语言模型对待识别文本进行文本划分,得到包括多个二元序列(bigram)的划分结果;基于三元语言模型对待识别文本进行文本划分,得到包括多个三元序列(trigram)的划分结果。Optionally, in the method for identifying junk text according to the embodiment of the present invention, the text division of the text to be recognized based on the n-gram language model to obtain a division result including a plurality of n-gram sequences includes: Recognize text and divide the text to obtain a division result that includes multiple binary sequences. Based on the ternary language model, divide the text to be recognized to obtain a division result that includes multiple trigrams.
根据本发明实施例的另一个方面,提供了一种垃圾文本的识别装置,包括:文本划分单元,适于对待识别文本进行文本划分,得到划分结果;特征学习单元,适于基于该划分结果为待识别文本生成特征向量;第一分类单元,适于将该特征向量输入多个第一分类模型,以得到多个第一分类模型的输出,第一分类模型包括线性分类模型和深度学习分类模型;以及特征组合单元,适于至少对多个第一分类模型的输出进行组合,得到组合向量;第二分类单元,适于根据组合向量,采用第二分类模型来判断待识别文本是否为垃圾文本。According to another aspect of the embodiments of the present invention, a device for identifying junk text is provided, including: a text division unit adapted to perform text division on a text to be identified to obtain a division result; and a feature learning unit adapted to be based on the division result as The to-be-recognized text generates a feature vector; a first classification unit is adapted to input the feature vector into multiple first classification models to obtain multiple first classification models output, the first classification model includes a linear classification model and a deep learning classification model ; And a feature combination unit adapted to combine at least the outputs of multiple first classification models to obtain a combination vector; a second classification unit adapted to use the second classification model to determine whether the text to be identified is junk text based on the combination vector .
根据本发明实施例的另一个方面,提供了一种计算设备,包括:一个或多个处理器; 存储器;以及一个或多个程序,其中一个或多个程序存储在存储器中并被配置为由一个或多个处理器执行,该一个或多个程序包括用于执行根据本发明实施例的垃圾文本的识别方法中的任一方法的指令。According to another aspect of the embodiments of the present invention, there is provided a computing device including: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and configured to be implemented by One or more processors execute, and the one or more programs include instructions for performing any one of the methods for identifying junk text according to an embodiment of the present invention.
根据本发明实施例的还有一个方面,提供了一种存储程序的可读存储介质,该程序包括指令,该指令当由计算设备执行时,使得计算设备执行根据本发明实施例的垃圾文本的识别方法的中任一方法。According to yet another aspect of the embodiments of the present invention, a readable storage medium storing a program is provided, and the program includes instructions that, when executed by a computing device, cause the computing device to execute junk text according to an embodiment of the present invention. Any of the identification methods.
根据本发明实施例的垃圾文本的识别方法基于堆叠(stacking)算法,通过第二分类模型来整合多个第一分类模型而得到分类结果,结合多种类型的第一分类模型的优势,极大地提高了垃圾文本的识别能力,模型的性能更好。The method for identifying junk text according to the embodiment of the present invention is based on a stacking algorithm, and integrates a plurality of first classification models through a second classification model to obtain classification results. Combining the advantages of multiple types of first classification models greatly The recognition ability of junk text is improved, and the performance of the model is better.
进一步地,在堆叠算法的基础上结合引导集聚(bagging)算法,以基于bagging算法的集成学习分类模型作为第二分类模型,结合集成学习分类模型所包括的多个子分类模型的优势,进一步提高了垃圾文本的识别能力,同时防止模型的过拟合。Further, based on the stacking algorithm, combined with the guided bagging algorithm, the integrated learning classification model based on the bagging algorithm is used as the second classification model, and the advantages of multiple sub-classification models included in the integrated learning classification model are further improved. Recognition of junk text while preventing overfitting of the model.
进一步地,通过得到多个划分结果的方式,弥补了单一划分结果传递给分类模型的误差,同时在多个粒度上刻画了待识别文本的划分结果对分类的影响因素,进一步提高了垃圾文本的识别能力。Further, by obtaining multiple division results, the error of the single division result passed to the classification model is compensated, and the influencing factors of the classification results of the text to be identified on the classification are described at multiple granularities, which further improves the garbage text. Recognition.
进一步地,通过线性分类模型的L1正则化项保证了特征的稀疏性、从而保证了更好的垃圾文本识别效果,同时通过深度学习分类模型的丢弃机制进一步防止模型的过拟合。Further, the L1 regularization term of the linear classification model ensures the sparseness of the features, thereby ensuring better spam text recognition, and further prevents the model from overfitting through the discard mechanism of the deep learning classification model.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
为了实现上述以及相关目的,本文结合下面的描述和附图来描述某些说明性方面,这些方面指示了可以实践本文所公开的原理的各种方式,并且所有方面及其等效方面旨在落入所要求保护的主题的范围内。通过结合附图阅读下面的详细描述,本公开的上述以及其它目的、特征和优势将变得更加明显。遍及本公开,相同的附图标记通常指代相同的部件或元素。To achieve the above and related objectives, certain illustrative aspects are described herein in conjunction with the following description and accompanying drawings, which indicate various ways in which the principles disclosed herein can be practiced, and all aspects and their equivalents are intended to be Within the scope of the claimed subject matter. The above and other objects, features, and advantages of the present disclosure will become more apparent by reading the following detailed description in conjunction with the accompanying drawings. Throughout this disclosure, the same reference numerals generally refer to the same parts or elements.
图1示出了根据本发明一个实施例的垃圾文本识别系统100的示意图;FIG. 1 shows a schematic diagram of a junk text recognition system 100 according to an embodiment of the present invention;
图2示出了根据本发明一个实施例的垃圾文本的识别装置200的结构框图;FIG. 2 shows a structural block diagram of a device 200 for identifying junk text according to an embodiment of the present invention;
图3示出了根据本发明一个实施例的计算设备300的结构框图;FIG. 3 shows a structural block diagram of a computing device 300 according to an embodiment of the present invention;
图4示出了根据本发明一个实施例的垃圾文本的识别方法400的流程图;以及FIG. 4 shows a flowchart of a method 400 for identifying junk text according to an embodiment of the present invention; and
图5示出了根据本发明一个实施例的垃圾文本的识别装置500的结构框图。FIG. 5 shows a structural block diagram of a junk text recognition device 500 according to an embodiment of the present invention.
具体实施方式detailed description
下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例,然而应当理解,可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反,提供这些实施例是为了能够更透彻地理解本公开,并且能够将本公开的范围完整的传达给本领域的技术人员。Hereinafter, exemplary embodiments of the present disclosure will be described in more detail with reference to the accompanying drawings. Although exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure can be implemented in various forms and should not be limited by the embodiments set forth herein. On the contrary, these embodiments are provided to enable a thorough understanding of the present disclosure, and to fully convey the scope of the present disclosure to those skilled in the art.
图1示出了根据本发明一个实施例的垃圾文本识别系统100的示意图。如图1所示,垃圾文本识别系统100包括识别前端110和垃圾文本的识别装置200。这里的垃圾文本指的是非正常文本,或者说包括垃圾信息的文本。一个典型的垃圾文本为包括色情信息的文本。以下是垃圾文本的一个示例:FIG. 1 shows a schematic diagram of a junk text recognition system 100 according to an embodiment of the present invention. As shown in FIG. 1, the junk text recognition system 100 includes a recognition device 200 that recognizes a front end 110 and junk text. Spam text here refers to abnormal text, or text that includes spam. A typical junk text is text that includes pornographic information. The following is an example of junk text:
“是苏静吗?波多野结衣携手四位天后级名优重出江湖,为您提供视频观看地址http://tb.cn/GIQZkPw”。"Is it Su Jing? Hatano Yui joins the four talented queens to make a comeback and provide you with a video viewing address http://tb.cn/GIQZkPw".
识别前端110是任何需要判断待识别文本是否为垃圾文本的请求方。例如,在一种方式中,处理前端110可以是一个即时通信系统的一部分。即时通信系统可以接收用户输入的消息。如果该消息是垃圾文本,即时通信系统需要拦截该消息。如果该消息不是垃圾文本,即时通信系统可以发送该消息。因此,识别前端110会将该消息发送到垃圾文本的识别装置200进行垃圾文本的判断,以确定该消息是否为垃圾文本。The identification front end 110 is any requester that needs to determine whether the text to be identified is junk text. For example, in one approach, the processing front end 110 may be part of an instant messaging system. The instant messaging system can receive messages entered by the user. If the message is junk text, the instant messaging system needs to intercept the message. If the message is not junk text, the instant messaging system can send the message. Therefore, the recognition front end 110 sends the message to the junk text recognition device 200 to determine the junk text to determine whether the message is junk text.
识别前端110也可以是一个评论审核系统的一部分。评论审核系统可以接收用户输入的评论。如果该评论是垃圾文本,评论审核系统需要拦截该评论。如果该评论不是垃圾文本,评论审核系统可以发布该评论。在这种情况下,识别前端110可以将包括该评论的垃圾文本识别请求发送给垃圾文本的识别装置200进行处理。The identification front end 110 may also be part of a review review system. The review moderation system can receive reviews entered by the user. If the comment is spam, the comment review system needs to block the comment. If the comment is not spam, the comment review system can post the comment. In this case, the recognition front end 110 may send a spam text recognition request including the comment to the spam text recognition device 200 for processing.
本发明不受限于识别前端110的具体形式。垃圾文本的识别装置200还可以以各种方式接收识别前端110的请求。例如垃圾文本的识别装置200可以提供具有预定格式定义的应用程序接口(API),以方便识别前端110根据定义来组织垃圾文本识别请求,并发送到垃圾文本的识别装置200。The invention is not limited to the specific form of identifying the front end 110. The junk text recognition device 200 may also receive requests to identify the front end 110 in various ways. For example, the junk text recognition device 200 may provide an application program interface (API) with a predetermined format definition to facilitate the recognition front end 110 to organize the junk text recognition request according to the definition and send it to the junk text recognition device 200.
垃圾文本的识别装置200接收该请求,从请求中获取待识别文本,并判断该待识别文本是否为垃圾文本。The junk text recognition device 200 receives the request, obtains the text to be recognized from the request, and determines whether the text to be recognized is junk text.
图2示出了根据本发明一个实施例的垃圾文本的识别装置200的结构框图。如图2所示,垃圾文本的识别装置200包括文本划分单元210、特征学习单元220、第一分类单元230、特征组合单元240和第二分类单元250。FIG. 2 shows a structural block diagram of a device 200 for identifying junk text according to an embodiment of the present invention. As shown in FIG. 2, the junk text recognition device 200 includes a text division unit 210, a feature learning unit 220, a first classification unit 230, a feature combination unit 240, and a second classification unit 250.
文本划分单元210适于对待识别文本进行文本划分,得到划分结果。特征学习单元 220适于基于该划分结果为待识别文本生成特征向量。第一分类单元230适于将该特征向量输入多个第一分类模型,以得到这多个第一分类模型的输出,第一分类模型可以包括线性分类模型和深度学习分类模型。特征组合单元240适于至少对这多个第一分类模型的输出进行组合,得到组合向量。第二分类单元250适于根据该组合向量,采用第二分类模型来判断待识别文本是否为垃圾文本。The text dividing unit 210 is adapted to perform text division on the text to be recognized to obtain a division result. The feature learning unit 220 is adapted to generate a feature vector for the text to be recognized based on the division result. The first classification unit 230 is adapted to input the feature vector into a plurality of first classification models to obtain outputs of the plurality of first classification models. The first classification model may include a linear classification model and a deep learning classification model. The feature combination unit 240 is adapted to combine at least the outputs of the plurality of first classification models to obtain a combination vector. The second classification unit 250 is adapted to use the second classification model to determine whether the text to be recognized is junk text based on the combined vector.
第一分类单元230和第二分类单元250均可以包括多个处理模块。第一分类单元230所包括的处理模块可以实现预想的多个第一分类模型。第二分类分类单元250所包括的处理模块可以实现预想的第二分类模型。第一分类模型和第二分类模型的输出均可以指示待识别文本是否为垃圾文本,或者说待识别文本为垃圾文本的概率。Each of the first classification unit 230 and the second classification unit 250 may include a plurality of processing modules. The processing module included in the first classification unit 230 may implement a plurality of first classification models expected. The processing module included in the second classification and classification unit 250 may implement an intended second classification model. The output of the first classification model and the second classification model may both indicate whether the text to be identified is junk text, or the probability of the text to be identified being junk text.
第一分类模型和第二分类模型中具有大量的计算参数,这些参数需要通过训练来进行调整以便在实际使用中获得最好的分类效果。因此,在第一分类单元230和第二分类单元250中的各个处理模块都包括大量的计算参数等待训练。如图1所示,垃圾文本识别系统100还包括全训练集130。可以利用全训练集130中的样本对第一分类模型和第二分类模型进行训练。全训练集130包括多个标注有标签的样本,该标签指示样本是否为垃圾文本。样本通常包括正样本和负样本。正样本的标签指示样本为垃圾文本,而负样本的标签指示样本不为垃圾文本。在全训练集130中需要包括预定比例的正样本和负样本,从而可以更全面地训练第一分类模型和第二分类模型。The first classification model and the second classification model have a large number of calculation parameters, which need to be adjusted through training in order to obtain the best classification effect in actual use. Therefore, each processing module in the first classification unit 230 and the second classification unit 250 includes a large number of calculation parameters waiting for training. As shown in FIG. 1, the junk text recognition system 100 further includes a full training set 130. The first classification model and the second classification model may be trained using the samples in the full training set 130. The full training set 130 includes a plurality of labeled samples indicating whether the samples are junk text. Samples usually include positive and negative samples. A label of a positive sample indicates that the sample is junk text, and a label of a negative sample indicates that the sample is not junk text. It is necessary to include a predetermined proportion of positive samples and negative samples in the full training set 130, so that the first classification model and the second classification model can be more comprehensively trained.
具体地,第一分类模型是利用第一训练集、以上述特征向量为输入训练得到。第二分类模型是利用第二训练集、以上述组合向量为输入训练得到,该第一训练集和该第二训练集均是从全训练集中抽样得到。第一训练集不等于第二训练集。也就是说,第一训练集与第二训练集可以存在交集,但不可以等同。当然,第一训练集和第二训练集也可以没有交集。Specifically, the first classification model is obtained by training using the first training set and using the above feature vector as an input. The second classification model is obtained by using the second training set and the above combination vector as input training, and the first training set and the second training set are both sampled from the full training set. The first training set is not equal to the second training set. That is, the first training set and the second training set may have intersections, but they may not be equivalent. Of course, the first training set and the second training set may also have no intersection.
其中,为了保证特征的稀疏性,线性分类模型可以利用L1正则化来进行训练,即线性分类模型的损失函数可以具有L1正则化项(即L1范数)。为了防止模型的过拟合,深度学习分类模型则可以利用丢弃(dropout)机制来进行训练,即深度学习分类模型在训练过程中可以按照预定的丢弃比例来丢弃部分神经元。Among them, in order to ensure the sparseness of the features, the linear classification model can be trained using L1 regularization, that is, the loss function of the linear classification model can have the L1 regularization term (that is, L1 norm). In order to prevent the model from overfitting, the deep learning classification model can be trained using a dropout mechanism, that is, during the training process, the deep learning classification model can discard some neurons according to a predetermined discarding ratio.
为了进一步避免过拟合的发生,第二分类模型可以包括集成学习分类模型,该集成学习模型是一种基于引导聚集算法(Bootstrap aggregating,简称为bagging算法)的集成学习分类模型。To further avoid overfitting, the second classification model may include an ensemble learning classification model, which is a ensemble learning classification model based on Bootstrap Aggregating (bagging algorithm for short).
该集成学习分类模型包括预定数目个子分类模型。可以从第二数据集中均匀、有放 回地(即使用自助抽样法)抽样出预定数目个同等大小的子训练集,以训练预定数目个子分类模型。这预定数目个子训练集与预定数目个子分类模型一一对应。即,子分类模型利用与该子分类模型相对应的子训练集训练得到。The ensemble learning classification model includes a predetermined number of sub-classification models. A predetermined number of sub-training sets of the same size can be sampled from the second data set uniformly and reproducibly (that is, using a self-service sampling method) to train a predetermined number of sub-classification models. The predetermined number of sub-training sets correspond one-to-one with the predetermined number of sub-classification models. That is, the sub-classification model is trained using a sub-training set corresponding to the sub-classification model.
训练好每个子分类模型之后,第二分类单元250可以将上述组合向量分别输入集成学习分类模型所包含的每个子分类模型,以便采用投票机制或取平均,根据每个子分类模型的输出来确定第二分类模型的输出。而后根据第二分类模型的输出来判断待识别文本是否为垃圾文本。After training each sub-classification model, the second classification unit 250 may input the above combination vectors into each sub-classification model included in the integrated learning classification model separately, so as to adopt a voting mechanism or take an average, and determine the first according to the output of each sub-classification model. Output of the binary classification model. Then, it is determined whether the text to be recognized is junk text according to the output of the second classification model.
根据本发明的一种实施方式,上述划分结果可以包括多个划分结果。对于各划分结果,可以以基于该划分结果生成的特征向量为输入训练得到与该划分结果相对应的多个第一分类模型。According to an embodiment of the present invention, the division result may include multiple division results. For each segmentation result, a plurality of first classification models corresponding to the segmentation result may be obtained by training the feature vector generated based on the segmentation result as input.
相应地,特征学习单元220可以对于各划分结果,均基于该划分结果为待识别文本生成特征向量。第一分类单元230可以对于各划分结果,将基于该划分结果的特征向量输入与该划分结果相对应的多个第一分类模型,以得到与该划分结果相对应的多个第一分类模型的输出。因此,至少对多个第一分类模型的输出进行组合,得到组合向量的步骤则可以包括:至少对与各划分结果相对应的第一分类模型的输出进行组合,得到该组合向量。Correspondingly, the feature learning unit 220 may generate a feature vector for the text to be recognized for each division result based on the division result. The first classification unit 230 may, for each division result, input a feature vector based on the division result into a plurality of first classification models corresponding to the division result, so as to obtain a plurality of first classification models corresponding to the division result. Output. Therefore, at least combining the outputs of multiple first classification models to obtain a combination vector may include: at least combining the outputs of the first classification models corresponding to the division results to obtain the combination vector.
根据本发明的另一种实施方式,在待识别文本为包括消息签名的消息的情况下,垃圾文本的识别装置200还可以包括历史概率计算单元260(图1未示出)。历史概率计算单元260适于基于消息签名,计算待识别文本为垃圾文本的历史概率。相应地,可以对该历史概率和所有第一分类模型的输出进行组合,得到上述组合向量。According to another embodiment of the present invention, when the text to be recognized is a message including a message signature, the device 200 for identifying junk text may further include a historical probability calculation unit 260 (not shown in FIG. 1). The historical probability calculation unit 260 is adapted to calculate a historical probability that the text to be identified is junk text based on the message signature. Accordingly, the historical probability and the outputs of all the first classification models can be combined to obtain the above-mentioned combination vector.
这里,消息(Short Message)是指从一方(即消息发送方)发送至另一方(即消息接收方)的文本,并包括消息签名。消息签名用于唯一标识消息发送方,通常可以是公司名称、品牌名称、项目名称或者应用名称等等。消息签名一般位于消息的开头,并以类似“【】”这样的分隔符与其它内容进行区分。以下为消息的一个示例:“【XX外卖】您的外卖已送达。”。其中,“XX外卖”为该消息的消息签名。Here, the message (Short Message) refers to the text sent from one party (ie, the message sender) to the other party (ie, the message receiver), and includes the message signature. The message signature is used to uniquely identify the sender of the message, which can usually be the company name, brand name, project name, or application name. The message signature is usually located at the beginning of the message, and is separated from other contents by a separator such as "[]". The following is an example of the message: "[XX Takeaway] Your takeaway has been delivered.". Among them, "XX Takeaway" is the message signature of the message.
在下文中,将结合附图描述在上文中提及的各个装置和单元等的具体结构以及对应的处理方法。In the following, the specific structures of the various devices and units mentioned above and the corresponding processing methods will be described with reference to the drawings.
根据本发明的实施方式,上述垃圾文本识别系统100中的各种部件,如各种单元和装置等均可以通过如下所述的计算设备300来实现。图3示出了根据本发明一个实施例的计算设备300的示意图。According to the embodiment of the present invention, various components in the above-mentioned junk text recognition system 100, such as various units and devices, may be implemented by the computing device 300 described below. FIG. 3 shows a schematic diagram of a computing device 300 according to one embodiment of the invention.
如图3所示,在基本配置302中,计算设备300典型地包括系统存储器306和一个或者多个处理器304。存储器总线308可以用于在处理器304和系统存储器306之间的通信。As shown in FIG. 3, in a basic configuration 302, the computing device 300 typically includes a system memory 306 and one or more processors 304. The memory bus 308 may be used for communication between the processor 304 and the system memory 306.
取决于期望的配置,处理器304可以是任何类型的处理,包括但不限于:微处理器(μP)、微控制器(μC)、数字信息处理器(DSP)或者它们的任何组合。处理器304可以包括诸如一级高速缓存310和二级高速缓存312之类的一个或者多个级别的高速缓存、处理器核心314和寄存器316。示例的处理器核心314可以包括运算逻辑单元(ALU)、浮点数单元(FPU)、数字信号处理核心(DSP核心)或者它们的任何组合。示例的存储器控制器318可以与处理器304一起使用,或者在一些实现中,存储器控制器318可以是处理器304的一个内部部分。Depending on the desired configuration, the processor 304 may be any type of processing, including but not limited to: a microprocessor (μP), a microcontroller (μC), a digital information processor (DSP), or any combination thereof. The processor 304 may include one or more levels of cache, such as a primary cache 310 and a secondary cache 312, a processor core 314, and a register 316. The example processor core 314 may include an arithmetic logic unit (ALU), a floating point unit (FPU), a digital signal processing core (DSP core), or any combination thereof. An example memory controller 318 may be used with the processor 304, or in some implementations, the memory controller 318 may be an internal part of the processor 304.
取决于期望的配置,系统存储器306可以是任意类型的存储器,包括但不限于:易失性存储器(诸如RAM)、非易失性存储器(诸如ROM、闪存等)或者它们的任何组合。系统存储器306可以包括操作系统320、一个或者多个应用322以及程序数据324。在一些实施方式中,应用322可以布置为在操作系统上由一个或多个处理器304利用程序数据324执行指令。Depending on the desired configuration, the system memory 306 may be any type of memory, including but not limited to: volatile memory (such as RAM), non-volatile memory (such as ROM, flash memory, etc.), or any combination thereof. The system memory 306 may include an operating system 320, one or more applications 322, and program data 324. In some embodiments, the application 322 may be arranged to execute instructions on the operating system by the one or more processors 304 using program data 324.
取决于期望的配置,计算设备300还可以包括储存设备332和储存接口总线334,储存设备332中可以把控可移除储存器336和不可移除储存器338。Depending on the desired configuration, the computing device 300 may further include a storage device 332 and a storage interface bus 334 in which the removable storage 336 and the non-removable storage 338 may be controlled.
计算设备300还可以包括有助于从各种接口设备(例如,输出设备342、外设接口344和通信设备346)到基本配置302经由总线/接口控制器330的通信的接口总线340。示例的输出设备342包括图形处理单元348和音频处理单元350。它们可以被配置为有助于经由一个或者多个A/V端口352与诸如显示器或者扬声器之类的各种外部设备进行通信。示例外设接口344可以包括串行接口控制器354和并行接口控制器356,它们可以被配置为有助于经由一个或者多个I/O端口358和诸如输入设备(例如,键盘、鼠标、笔、语音输入设备、触摸输入设备)或者其他外设(例如打印机、扫描仪等)之类的外部设备进行通信。示例的通信设备346可以包括网络控制器360,其可以被布置为便于经由一个或者多个通信端口364与一个或者多个其他计算设备362通过网络通信链路的通信。The computing device 300 may also include an interface bus 340 that facilitates communication from various interface devices (eg, output devices 342, peripheral interfaces 344, and communication devices 346) to the basic configuration 302 via the bus / interface controller 330. The example output device 342 includes a graphics processing unit 348 and an audio processing unit 350. They can be configured to facilitate communication with various external devices such as a display or speakers via one or more A / V ports 352. The example peripheral interface 344 may include a serial interface controller 354 and a parallel interface controller 356, which may be configured to facilitate communication via one or more I / O ports 358 and such as input devices (e.g., keyboard, mouse, pen , Voice input devices, touch input devices) or other peripheral devices (such as printers, scanners, etc.) to communicate. An example communication device 346 may include a network controller 360, which may be arranged to facilitate communication with one or more other computing devices 362 over a network communication link via one or more communication ports 364.
网络通信链路可以是通信介质的一个示例。通信介质通常可以体现为在诸如载波或者其他传输机制之类的调制数据信号中的计算机可读指令、数据结构、程序模块,并且可以包括任何信息递送介质。“调制数据信号”可以是这样的信号,它的数据集中的一 个或者多个或者它的改变可以在信号中编码信息的方式进行。作为非限制性的示例,通信介质可以包括诸如有线网络或者专线网络之类的有线介质,以及诸如声音、射频(RF)、微波、红外(IR)或者其它无线介质在内的各种无线介质。这里使用的术语计算机可读介质可以包括存储介质和通信介质二者。A network communication link may be one example of a communication medium. Communication media may typically be embodied as computer-readable instructions, data structures, program modules in a modulated data signal such as a carrier wave or other transmission mechanism, and may include any information delivery media. A "modulated data signal" can be a signal in which one or more of its data sets or its changes can be made in a manner that encodes information in the signal. As non-limiting examples, communication media may include wired media such as a wired network or a dedicated network, and various wireless media such as sound, radio frequency (RF), microwave, infrared (IR), or other wireless media. The term computer-readable media as used herein may include both storage media and communication media.
计算设备300可以实现为服务器,例如数据库服务器、应用程序服务器和WEB服务器等,也可以实现为包括桌面计算机和笔记本计算机配置的个人计算机。当然,计算设备300也可以实现为小尺寸便携(或者移动)电子设备的一部分。The computing device 300 may be implemented as a server, such as a database server, an application server, and a WEB server, and may also be implemented as a personal computer including a desktop computer and a notebook computer configuration. Of course, the computing device 300 may also be implemented as part of a small-sized portable (or mobile) electronic device.
在根据本发明的实施例中,计算设备300被实现为垃圾文本的识别装置200,并被配置为执行根据本发明实施例的垃圾文本的识别方法400。其中,计算设备300的应用322中包含执行根据本发明实施例的垃圾文本的识别方法400的多条程序指令,而程序数据324还可以存储垃圾文本的识别装置200的配置信息等。In an embodiment according to the present invention, the computing device 300 is implemented as a device 200 for identifying junk text, and is configured to perform a method 400 for identifying a junk text according to an embodiment of the present invention. The application 322 of the computing device 300 includes multiple program instructions for executing the method 400 for identifying junk text according to an embodiment of the present invention, and the program data 324 may further store configuration information of the device 200 for identifying junk text.
图4示出了根据本发明一个实施例的垃圾文本的识别方法400的流程图。如图4所示,垃圾文本的识别方法400始于步骤S410。FIG. 4 shows a flowchart of a method 400 for identifying junk text according to an embodiment of the present invention. As shown in FIG. 4, the method 400 for identifying junk text starts at step S410.
在步骤S410中,对待识别文本进行文本划分,得到划分结果。In step S410, text division is performed on the text to be recognized to obtain a division result.
可以利用本领域的任何文本划分方法来对待识别文本进行文本划分。例如,可以利用分词算法对待识别文本进行文本划分,得到包括多个分词的划分结果。本发明对具体的分词算法不做限制。以下是一个示例:Any text division method in the art can be used to perform text division on the text to be recognized. For example, a text segmentation algorithm may be used to perform text segmentation on the text to be recognized, and a segmentation result including multiple segmentations may be obtained. The invention does not limit the specific word segmentation algorithm. Here is an example:
利用分词算法对“今天天气很好”进行文本划分,得到包括“今天”、“天气”、“很”和“好”这四个分词的划分结果。The word segmentation algorithm was used to divide the text into "the weather is very good today", and the results of the segmentation including the words "today", "weather", "very" and "good" were obtained.
又例如,可以基于n元语言模型对待识别文本进行文本划分,得到包括多个n元序列(n-gram)的划分结果。在本发明的实施例中,n通常取值为2或3。For another example, text division may be performed on the text to be recognized based on the n-gram language model to obtain a division result including a plurality of n-grams. In the embodiment of the present invention, n usually takes a value of 2 or 3.
n元语言模型是基于(n-1)阶马尔可夫链的一种概率语言模型,其通过n个项目(item)出现的概率来推断语句的结构。n元序列(n-gram)则是来自给定文本的n个项目的连续序列。项目通常可以是音素、音节、字母、单词、字。在本发明的实施例中,n元序列(n-gram)为来自待识别文本的n个字的连续序列。以下是两个示例:The n-gram language model is a probabilistic language model based on the (n-1) -order Markov chain, which infers the structure of a sentence by the probability of the occurrence of n items. An n-gram is a continuous sequence of n items from a given text. Items can usually be phonemes, syllables, letters, words, words. In the embodiment of the present invention, the n-gram is a continuous sequence of n words from the text to be recognized. Here are two examples:
利用二元语言模型算法对“今天天气很好”进行文本划分,得到包括“今天”、“天天”、“天气”、“气很”和“很好”这五个二元序列(bigram)的划分结果。Using the binary language model algorithm to divide the text of "the weather is very good today", we get five binary sequences including "today", "every day", "weather", "qi very" and "good" Divide the results.
利用三元语言模型算法对“今天天气很好”进行文本划分,得到包括“今天天”、“天天气”、“天气很”和“气很好”这四个三元序列(trigram)的划分结果。Using the ternary language model algorithm to divide the text "today's weather is good", we get four trigrams including "today today", "weather", "weather is very" and "gas is good" result.
随后在步骤S420中,基于该划分结果为待识别文本生成特征向量。具体地,特征 向量可以包括第一向量和第二向量。Then in step S420, a feature vector is generated for the text to be recognized based on the division result. Specifically, the feature vector may include a first vector and a second vector.
可以至少采用词袋模型(bag of words),基于该划分结果生成待识别文本的第一向量。其中,可以先采用词袋模型,基于该划分结果生成待识别文本的词袋向量。再采用诸如词频-逆文档频率(TF-IDF)算法、互信息算法之类的特征提取方法对词袋向量进行处理后得到第一向量。A bag of words model can be used at least, and a first vector of text to be recognized is generated based on the division result. Among them, a bag-of-words model may be adopted first, and a bag-of-words vector of text to be recognized is generated based on the division result. Then, a feature extraction method such as a word frequency-inverse document frequency (TF-IDF) algorithm and a mutual information algorithm is used to process the bag of words vector to obtain a first vector.
可以采用词嵌入(embedding)模型,基于该划分结果生成待识别文本的第二向量,也就是词向量。本发明对具体的词嵌入模型不做限制,例如可以采用Skip-Gram模型或者CBOW(continuous bags of word)模型。A word embedding model may be adopted, and a second vector, that is, a word vector, of the text to be recognized is generated based on the division result. The present invention does not limit the specific word embedding model. For example, a Skip-Gram model or a CBOW (continuous bags of word) model can be used.
随后在步骤S430中,将上述特征向量输入多个第一分类模型,以得到多个第一分类模型的输出。Subsequently, in step S430, the feature vectors are input to a plurality of first classification models to obtain outputs of the plurality of first classification models.
由于第一分类模型可以包括线性分类模型和深度学习分类模型这两种类型的模型,可以将基于上述划分结果生成的第一向量输入线性分类模型,将基于上述划分结果生成的第二向量输入深度学习分类模型。Since the first classification model may include two types of models, a linear classification model and a deep learning classification model, the first vector generated based on the above-mentioned division result may be input into the linear classification model, and the second vector generated based on the above-mentioned division result may be input into the depth Learn classification models.
得到多个第一分类模型的输出之后,在步骤S440中,至少对多个第一分类模型的输出进行组合,得到组合向量。简单来说,至少可以将多个第一分类模型的输出拼接得到组合向量。After obtaining the outputs of the plurality of first classification models, in step S440, at least the outputs of the plurality of first classification models are combined to obtain a combination vector. In simple terms, at least the outputs of multiple first classification models can be stitched to obtain a combined vector.
根据本发明的一种实施方式,待识别文本可以是前文介绍过的消息,该消息包括消息签名。因而垃圾文本的识别方法400还可以包括基于消息签名,计算待识别文本为垃圾文本的历史概率的步骤。具体地,可以获取包括待识别文本的消息签名的历史消息(可以从预设的历史消息数据库中获取)。计算所获取的历史消息中确定为垃圾文本的部分历史消息与所获取的历史消息的数量之比,以作为待识别文本为垃圾文本的历史概率。According to an embodiment of the present invention, the text to be recognized may be a message introduced in the foregoing, and the message includes a message signature. Therefore, the method 400 for identifying junk text may further include the step of calculating a historical probability that the text to be identified is junk text based on the message signature. Specifically, a historical message including a message signature of the text to be recognized may be obtained (which may be obtained from a preset historical message database). Calculate the ratio of the part of the obtained historical messages determined as junk text to the number of acquired historical messages as the historical probability that the text to be identified is junk text.
相应地,则可以将待识别文本为垃圾文本的历史概率与所有第一分类模型的输出组合,得到上述组合向量。这样,通过引入基于消息签名计算得到的历史概率,可以从各种维度评估待识别文本为垃圾文本的风险。Correspondingly, the historical probability of the text to be recognized as junk text can be combined with the output of all the first classification models to obtain the above-mentioned combination vector. In this way, by introducing the historical probability calculated based on the message signature, the risk of the text to be identified as junk text can be evaluated from various dimensions.
最后,在步骤S450中,可以根据该组合向量,采用第二分类模型来判断待识别文本是否为垃圾文本。也就是说,可以将组合向量输入第二分类模型,以得到第二分类模型的输出。而后根据第二分类模型的输出来判断待识别文本是否为垃圾文本。Finally, in step S450, according to the combined vector, a second classification model may be used to determine whether the text to be recognized is junk text. That is, the combination vector can be input to the second classification model to obtain the output of the second classification model. Then, it is determined whether the text to be recognized is junk text according to the output of the second classification model.
根据本发明的一种实施方式,第一分类模型是利用第一训练集、以上述特征向量为输入训练得到,第二分类模型是利用第二训练集、以上述组合向量为输入训练得到。该 第一训练集和该第二训练集都是从全训练集中抽样得到。全训练集包括多个标注有标签的样本,该标签指示样本是否为垃圾文本。样本通常包括正样本和负样本。正样本的标签指示样本为垃圾文本,而负样本的标签指示样本不为垃圾文本。在全训练集130中需要包括预定比例的正样本和负样本,从而可以更全面地训练第一分类模型和第二分类模型。According to an embodiment of the present invention, the first classification model is obtained by training using the first training set and using the feature vector as an input, and the second classification model is obtained by using the second training set and using the combination vector as the input. The first training set and the second training set are both sampled from the full training set. The full training set includes multiple labeled samples that indicate whether the samples are junk text. Samples usually include positive and negative samples. A label of a positive sample indicates that the sample is junk text, and a label of a negative sample indicates that the sample is not junk text. It is necessary to include a predetermined proportion of positive samples and negative samples in the full training set 130, so that the first classification model and the second classification model can be more comprehensively trained.
第一训练集不等于第二训练集。也就是说,第一训练集与第二训练集可以存在交集,但不可以等同。当然,第一训练集和第二训练集也可以没有交集。The first training set is not equal to the second training set. That is, the first training set and the second training set may have intersections, but they may not be equivalent. Of course, the first training set and the second training set may also have no intersection.
其中,一方面线性分类模型可以利用L1正则化来进行训练,以保证特征的稀疏性。即线性分类模型的损失函数可以具有L1正则化项(即L1范数)。在一种实施方式中,线性分类模型可以包括逻辑回归模型和/或支持向量机(SVM)模型,逻辑回归模型可以利用L1正则化来进行训练。Among them, on the one hand, the linear classification model can be trained using L1 regularization to ensure the sparseness of features. That is, the loss function of the linear classification model may have an L1 regularization term (ie, L1 norm). In one embodiment, the linear classification model may include a logistic regression model and / or a support vector machine (SVM) model, and the logistic regression model may be trained using L1 regularization.
本发明对具体的线性分类模型不做限制,除了逻辑回归模型和支持向量机模型之外,还可以是其他线性分类模型。The present invention does not limit the specific linear classification model. In addition to the logistic regression model and the support vector machine model, it can also be other linear classification models.
另一方面深度学习分类模型可以利用丢弃机制来进行训练,以防止过拟合。即深度学习分类模型在训练过程中可以按照预定的丢弃比例来丢弃部分神经元。在一种实施方式中,深度学习分类模型可以包括卷积神经网络模型(CNN)和/或循环神经网络模型(RNN)。On the other hand, deep learning classification models can be trained using the discard mechanism to prevent overfitting. That is, the deep learning classification model can discard some neurons according to a predetermined discarding ratio during the training process. In one embodiment, the deep learning classification model may include a convolutional neural network model (CNN) and / or a recurrent neural network model (RNN).
本发明对具体的深度学习分类模型不做限制,除了卷积神经网络模型和循环神经网络模型之外,还可以是其他深度学习分类模型。The invention does not limit the specific deep learning classification model. In addition to the convolutional neural network model and the recurrent neural network model, it can also be other deep learning classification models.
此外,为了进一步提高分类结果的正确率和防止过拟合,第二分类模型可以包括集成学习分类模型,具体地,可以包括基于引导聚集算法的集成学习分类模型。该集成学习分类模型包括预定数目个子分类模型,可以将上述组合向量分别输入集成学习分类模型所包含的每个子分类模型,以便采用投票机制或取平均,根据每个子分类模型的输出来确定第二分类模型的输出。In addition, in order to further improve the accuracy of classification results and prevent overfitting, the second classification model may include an integrated learning classification model, and specifically, may include an integrated learning classification model based on a guided aggregation algorithm. The ensemble learning classification model includes a predetermined number of sub-categorization models. The above combination vectors can be input into each sub-categorization model included in the ensemble learning classification model separately, in order to adopt a voting mechanism or average, and determine the second according to the output of each sub-categorization model. The output of the classification model.
其中,集成学习模型所包含的子分类模型是利用与该子分类模型相对应的子训练集训练得到,子训练集是从第二训练集中均匀有放回地抽样得到。具体地,可以从第二数据集中均匀、有放回地(即使用自助抽样法)抽样出预定数目个同等大小的子训练集,以训练预定数目个子分类模型。这预定数目个子训练集与预定数目个子分类模型一一对应。Among them, the sub-classification model included in the ensemble learning model is trained using a sub-training set corresponding to the sub-classification model, and the sub-training set is sampled from the second training set evenly and reproducibly. Specifically, a predetermined number of sub-training sets of the same size can be sampled from the second data set uniformly and with replacement (that is, using a self-service sampling method) to train a predetermined number of sub-classification models. The predetermined number of sub-training sets correspond one-to-one with the predetermined number of sub-classification models.
本发明对具体的集成学习模型不做限制,集成学习分类模型例如可以是随机森林模 型,或者梯度提升决策树模型(GBDT模型),其中子分类模型是决策树。预定数目通常可以取值为100。The invention does not limit the specific ensemble learning model. The ensemble learning classification model may be, for example, a random forest model or a gradient boosted decision tree model (GBDT model), where the sub-classification model is a decision tree. The predetermined number can usually take the value 100.
此外,根据本发明的一种实施方式,上述划分结果包括多个划分结果。具体地,可以基于分词算法对待识别文本进行文本划分,得到包括多个分词的划分结果。以及基于n元语言模型对待识别文本进行文本划分,得到包括多个n元序列的划分结果。In addition, according to an embodiment of the present invention, the division result includes a plurality of division results. Specifically, text segmentation can be performed based on a segmentation algorithm to obtain a segmentation result including a plurality of segmentations. And text division is performed on the text to be recognized based on the n-gram language model, and a division result including multiple n-gram sequences is obtained.
其中,包括多个n元序列的划分结果可以包括多个二元序列的划分结果和包括多个三元序列的划分结果,分别是基于二元语言模型和三元语言模型对待识别文本进行文本划分而得到。Among them, the division result including multiple n-ary sequences may include the division result of multiple binary sequences and the division result including multiple ternary sequences, which are respectively text division based on the binary language model and the ternary language model to be recognized. And get.
显然地,对于各划分结果,均需要以基于该划分结果生成的特征向量为输入训练出与该划分结果相对应的多个第一分类模型。Obviously, for each division result, it is necessary to use the feature vector generated based on the division result as an input to train a plurality of first classification models corresponding to the division result.
对于各划分结果,均可以基于该划分结果为待识别文本生成特征向量,并将基于该划分结果生成的特征向量输入与该划分结果相对应的多个第一分类模型,以得到与该划分结果相对应的多个第一分类模型的输出。这样,至少对多个第一分类模型的输出进行组合,得到组合向量的步骤即为:For each division result, a feature vector can be generated for the text to be recognized based on the division result, and the feature vector generated based on the division result is input to a plurality of first classification models corresponding to the division result to obtain the division result. Outputs of corresponding first classification models. In this way, at least the outputs of multiple first classification models are combined to obtain a combined vector.
至少对与各划分结果相对应的第一分类模型的输出进行组合,得到组合向量。At least the outputs of the first classification model corresponding to the division results are combined to obtain a combination vector.
根据本发明的一种实施方式,不同的划分结果对应有不同的L1正则化项。对应于包括多个分词的划分结果的线性分类模型通常准确率高,召回率低,因此需要加的L1正则化项较小。对应于包括多个bigram的划分结果的线性分类模型通常准确率低,召回率高,因此需要加的L1正则化项较大。对应于包括多个trigram的划分结果的线性分类模型准确率和召回率适中,但基于包括多个trigram的划分结果生成的特征向量规模非常高,因此也需要加较大的L1正则化项。在本发明的实施例中,包括多个分词的划分结果对应的L1正则化项最小,包括多个trigram的划分结果对应的L1正则化项居中,包括多个bigram的划分结果对应的L1正则化项最大。According to an embodiment of the present invention, different division results correspond to different L1 regularization terms. A linear classification model corresponding to a segmentation result including a plurality of word segmentation usually has a high accuracy rate and a low recall rate, so the L1 regularization term to be added is small. The linear classification model corresponding to the division results including multiple bigrams usually has low accuracy and high recall, so the L1 regularization term that needs to be added is large. The linear classification model corresponding to the division results including multiple trigrams has moderate accuracy and recall. However, the scale of feature vectors generated based on the division results including multiple trigrams is very high, so it is necessary to add a larger L1 regularization term. In the embodiment of the present invention, the L1 regularization term corresponding to the segmentation result including multiple segmentations is the smallest, the L1 regularization term corresponding to the segmentation result including multiple trigrams is centered, and the L1 regularization corresponding to the segmentation result including multiple bigrams. The item is the largest.
根据本发明的另一种实施方式,不同的划分结果对应有不同的丢弃比例。与L1正则化项类似地,在本发明的实施例中,包括多个分词的划分结果对应的L1正则化项最小,包括多个trigram的划分结果对应的L1正则化项居中,包括多个bigram的划分结果对应的L1正则化项最大。According to another embodiment of the present invention, different division results correspond to different discarding ratios. Similar to the L1 regularization term, in the embodiment of the present invention, the L1 regularization term corresponding to the segmentation result including multiple segmentations is the smallest, and the L1 regularization term corresponding to the segmentation result including multiple trigrams is centered, including multiple bigrams. The L1 regularization term corresponding to the division result is the largest.
图5示出了根据本发明一个实施例的垃圾文本的识别装置500的结构框图。如图5所示,垃圾文本的识别装置500可以包括第一文本划分单元510、第二文本划分单元512、第三文本划分单元514、第一特征学习单元520、第二特征学习单元522、第一基 础分类单元530、第二基础分类单元532、第三基础分类单元534、输出组合单元540和集成分类单元550。FIG. 5 shows a structural block diagram of a junk text recognition device 500 according to an embodiment of the present invention. As shown in FIG. 5, the junk text recognition device 500 may include a first text dividing unit 510, a second text dividing unit 512, a third text dividing unit 514, a first feature learning unit 520, a second feature learning unit 522, a first A basic classification unit 530, a second basic classification unit 532, a third basic classification unit 534, an output combining unit 540, and an integrated classification unit 550.
第一文本划分单元510适于基于分词算法对待识别文本进行文本划分,得到包括多个分词的第一划分结果。第二文本划分单元512适于基于二元语言模型对待识别文本进行文本划分,得到包括多个二元序列的第二划分结果。第三文本划分单元514适于基于三元语言模型对待识别文本进行文本划分,得到包括多个三元序列的第三划分结果。The first text division unit 510 is adapted to perform text division on the text to be recognized based on the word segmentation algorithm, to obtain a first division result including a plurality of word segmentation. The second text division unit 512 is adapted to perform text division on the text to be recognized based on the binary language model, and obtain a second division result including a plurality of binary sequences. The third text division unit 514 is adapted to perform text division on the text to be recognized based on the ternary language model, and obtain a third division result including a plurality of ternary sequences.
第一特征学习单元520分别与第一文本划分单元510、第二文本划分单元512和第三文本划分单元514相连接,并适于至少采用词袋模型,基于第一划分结果为待识别文本生成第一向量、基于第二划分结果为待识别文本生成第一向量、以及基于第三划分结果为待识别文本生成第一向量。The first feature learning unit 520 is connected to the first text division unit 510, the second text division unit 512, and the third text division unit 514, respectively, and is adapted to use at least a bag-of-words model to generate the text to be recognized based on the first division result. A first vector, generating a first vector for the text to be recognized based on the second division result, and generating a first vector for the text to be recognized based on the third division result.
第二特征学习单元522分别与第一文本划分单元510、第二文本划分单元512和第三文本划分单元514相连接,并适于采用词嵌入模型,基于第一划分结果为待识别文本生成第二向量、基于第二划分结果为待识别文本生成第二向量、以及基于第三划分结果为待识别文本生成第二向量。The second feature learning unit 522 is connected to the first text division unit 510, the second text division unit 512, and the third text division unit 514, respectively, and is adapted to use a word embedding model to generate a first text segment for the text to be recognized based on the first segmentation result. Two vectors, generating a second vector for the text to be recognized based on the second division result, and generating a second vector for the text to be recognized based on the third division result.
第一基础分类单元530分别与第一特征学习单元520和第二特征学习单元522相连接,并适于将基于第一划分结果生成的第一向量分别输入与第一划分结果相对应的第一逻辑回归模型和第一支持向量机模型,将基于第一划分结果生成的第二向量输入与第一划分结果相对应的第一卷积神经网络模型,以得到第一逻辑回归模型、第一支持向量机模型和第一卷积神经网络模型的输出。The first basic classification unit 530 is connected to the first feature learning unit 520 and the second feature learning unit 522, respectively, and is adapted to respectively input first vectors generated based on the first division result into the first corresponding to the first division result. The logistic regression model and the first support vector machine model input a second vector generated based on the first partition result into a first convolutional neural network model corresponding to the first partition result to obtain a first logistic regression model, a first support The output of the vector machine model and the first convolutional neural network model.
第二基础分类单元532分别与第一特征学习单元520和第二特征学习单元522相连接,并适于将基于第二划分结果生成的第一向量分别输入与第二划分结果相对应的第二逻辑回归模型和第二支持向量机模型,将基于第二划分结果生成的第二向量输入与第二划分结果相对应的第二卷积神经网络模型,以得到第二逻辑回归模型、第二支持向量机模型和第二卷积神经网络模型的输出。The second basic classification unit 532 is connected to the first feature learning unit 520 and the second feature learning unit 522, respectively, and is adapted to respectively input the first vectors generated based on the second division result into the second corresponding to the second division result. Logistic regression model and second support vector machine model. The second vector generated based on the second partition result is input to a second convolutional neural network model corresponding to the second partition result to obtain a second logistic regression model, a second support The output of the vector machine model and the second convolutional neural network model.
第三基础分类单元534分别与第一特征学习单元520和第二特征学习单元522相连接,并适于将基于第三划分结果生成的第一向量分别输入与第三划分结果相对应的第三逻辑回归模型和第三支持向量机模型,将基于第三划分结果生成的第二向量输入与第三划分结果相对应的第三卷积神经网络模型,以得到第三逻辑回归模型、第三支持向量机模型和第三卷积神经网络模型的输出。The third basic classification unit 534 is connected to the first feature learning unit 520 and the second feature learning unit 522, respectively, and is adapted to respectively input the first vectors generated based on the third division result into the third corresponding to the third division result. Logistic regression model and third support vector machine model. The second vector generated based on the third partition result is input to a third convolutional neural network model corresponding to the third partition result to obtain a third logistic regression model and a third support. Output of the vector machine model and the third convolutional neural network model.
输出组合单元540分别与第一基础分类单元530、第二基础分类单元532和第三基 础分类单元534相连接,并适于将第一逻辑回归模型、第一支持向量机模型、第一卷积神经网络模型、第二逻辑回归模型、第二支持向量机模型、第二卷积神经网络模型、第三逻辑回归模型、第三支持向量机模型和第三卷积神经网络模型的输出进行组合,得到组合向量。The output combining unit 540 is connected to the first basic classification unit 530, the second basic classification unit 532, and the third basic classification unit 534, respectively, and is adapted to integrate the first logistic regression model, the first support vector machine model, and the first convolution. Combining the output of a neural network model, a second logistic regression model, a second support vector machine model, a second convolutional neural network model, a third logistic regression model, a third support vector machine model, and a third convolutional neural network model, Get the combination vector.
根据本发明的一种实施方式,在待识别文本为包括消息的消息签名的情况下,输出组合单元540可以将待识别文本为垃圾文本的历史概率、第一逻辑回归模型、第一支持向量机模型、第一卷积神经网络模型、第二逻辑回归模型、第二支持向量机模型、第二卷积神经网络模型、第三逻辑回归模型、第三支持向量机模型和第三卷积神经网络模型的输出进行组合,得到组合向量。According to an embodiment of the present invention, in a case where the text to be recognized is a message signature including a message, the output combining unit 540 may use the historical probability of the text to be recognized as junk text, a first logistic regression model, and a first support vector machine. Model, first convolutional neural network model, second logistic regression model, second support vector machine model, second convolutional neural network model, third logistic regression model, third support vector machine model, and third convolutional neural network The outputs of the models are combined to obtain a combined vector.
集成分类单元550与输出组合单元540相连接,适于将组合向量分别输入随机森林模型所包含的各个决策树,以便采用投票机制,根据各个决策树的输出来确定随机森林模型的输出。最后根据随机森林模型的输出判断待识别文本是否为垃圾文本。The integrated classification unit 550 is connected to the output combination unit 540, and is suitable for inputting the combination vector into each decision tree included in the random forest model, so as to use a voting mechanism to determine the output of the random forest model according to the output of each decision tree. Finally, according to the output of the random forest model, it is judged whether the text to be identified is junk text.
以上在结合图1~图4说明垃圾文本的识别方法400的具体描述中已经对垃圾文本的识别装置500各单元中的相应处理进行了详细解释,这里不再对重复内容进行赘述。Corresponding processing in each unit of the device 500 for identifying junk text has been explained in detail in the detailed description of the method 400 for identifying junk text with reference to FIG. 1 to FIG. 4, and the repeated content will not be repeated here.
综上,根据本发明实施例的垃圾文本的识别方法基于堆叠(stacking)算法,通过第二分类模型来整合多个第一分类模型而得到分类结果,结合多种类型的第一分类模型的优势,极大地提高了垃圾文本的识别能力,模型的性能更好。In summary, the method for identifying junk text according to the embodiment of the present invention is based on a stacking algorithm, and integrates a plurality of first classification models through a second classification model to obtain classification results. The advantages of combining multiple types of first classification models are combined. , Which greatly improves the recognition ability of junk text, and the performance of the model is better.
进一步地,还在堆叠算法的基础上结合引导集聚(bagging)算法,以基于bagging算法的集成学习分类模型作为第二分类模型,以进一步提高垃圾文本的识别能力,同时防止模型的过拟合。Further, a guided bagging algorithm is combined on the basis of the stacking algorithm, and an integrated learning classification model based on the bagging algorithm is used as the second classification model to further improve the recognition ability of junk text while preventing the model from overfitting.
进一步地,还通过得到多个划分结果的方式,弥补了单一划分结果传递给分类模型的误差,同时在多个粒度上刻画了待识别文本的文本特征对分类的影响因素,以进一步提高垃圾文本的识别能力。Further, the method also obtains multiple division results to make up for the error of the single division result passed to the classification model. At the same time, the influencing factors of the text features of the text to be identified on the classification are described at multiple granularities to further improve the junk text. Recognition ability.
进一步地,还通过线性分类模型的L1正则化项保证了特征的稀疏性、从而保证了更好的垃圾文本识别效果,通过深度学习分类模型的丢弃机制进一步防止模型的过拟合。Furthermore, the L1 regularization term of the linear classification model also ensures the sparseness of the features, thereby ensuring better spam text recognition. The discard mechanism of the deep learning classification model further prevents the model from overfitting.
以对包括色情信息的消息进行识别为例,其识别难点主要在于:Taking the identification of messages that contain pornographic information as an example, the main difficulties in identifying them are:
包括色情信息的消息相对于正常消息的比例极低,通常在1:10000以上。方差大,消息所覆盖的类型非常广,包括色情信息的消息表现形式也异常复杂。此外包括色情信息的消息变种多,隐晦性很强。基于这些难点,传统的垃圾文本识别方案的识别 能力有限。例如,仅利用前述第一划分结果、第一向量和支持向量机模型的情况下,模型的F1值为0.954。仅利用前述第二划分结果、第一向量和支持向量机模型的情况下,模型的F1值为0.961。仅利用前述第一划分结果、第二向量和卷积神经网络模型的情况下,模型的F1值为0.971。根据本发明如图5所示的实施例的垃圾文本的识别方案,模型的F1值为0.987。其中F1值为F score,为模型准确率和召回率的调和均值。通常F1值越大,表示模型的性能越好。The proportion of messages containing pornographic information relative to normal messages is extremely low, usually above 1: 10000. The variance is large, the types of messages covered are very wide, and the expression forms of pornographic messages are also extremely complicated. In addition, there are many variants of pornographic messages, which are very vague. Based on these difficulties, the traditional spam text recognition scheme has limited recognition capabilities. For example, in a case where only the foregoing first division result, the first vector, and the support vector machine model are used, the F1 value of the model is 0.954. In the case where only the aforementioned second division result, the first vector, and the support vector machine model are used, the F1 value of the model is 0.961. In the case where only the foregoing first division result, the second vector, and the convolutional neural network model are used, the F1 value of the model is 0.971. According to the spam text recognition scheme of the embodiment shown in FIG. 5 of the present invention, the model's F1 value is 0.987. The F1 value is Fscore, which is the harmonic mean of model accuracy and recall. Generally, the larger the F1 value, the better the performance of the model.
显然地,根据本发明实施例的垃圾文本的识别方案,模型的性能更好更优越,识别更准确。Obviously, according to the junk text recognition scheme of the embodiment of the present invention, the performance of the model is better and superior, and the recognition is more accurate.
应当理解,这里描述的各种技术可结合硬件或软件,或者它们的组合一起实现。从而,本发明的方法和设备,或者本发明的方法和设备的某些方面或部分可采取嵌入有形媒介,例如软盘、CD-ROM、硬盘驱动器或者其它任意机器可读的存储介质中的程序代码(即指令)的形式,其中当程序被载入诸如计算机之类的机器,并被该机器执行时,该机器变成实践本发明的设备。It should be understood that the various techniques described herein may be implemented in conjunction with hardware or software, or a combination thereof. Thus, the method and apparatus of the present invention, or some aspects or parts of the method and apparatus of the present invention, may take program code embedded in a tangible medium, such as a floppy disk, CD-ROM, hard drive, or any other machine-readable storage medium (I.e., instructions) in which when a program is loaded into and executed by a machine such as a computer, the machine becomes a device for practicing the present invention.
在程序代码在可编程计算机上执行的情况下,计算设备一般包括处理器、处理器可读的存储介质(包括易失性和非易失性存储器和/或存储元件),至少一个输入装置,和至少一个输出装置。其中,存储器被配置用于存储程序代码;处理器被配置用于根据该存储器中存储的该程序代码中的指令,执行本发明的各种方法。Where the program code is executed on a programmable computer, the computing device generally includes a processor, a processor-readable storage medium (including volatile and non-volatile memory and / or storage elements), at least one input device, And at least one output device. The memory is configured to store program code; the processor is configured to execute various methods of the present invention according to instructions in the program code stored in the memory.
以示例而非限制的方式,计算机可读介质包括计算机存储介质和通信介质。计算机可读介质包括计算机存储介质和通信介质。计算机存储介质存储诸如计算机可读指令、数据结构、程序模块或其它数据等信息。通信介质一般以诸如载波或其它传输机制等已调制数据信号来体现计算机可读指令、数据结构、程序模块或其它数据,并且包括任何信息传递介质。以上的任一种的组合也包括在计算机可读介质的范围之内。By way of example, and not limitation, computer-readable media includes computer storage media and communication media. Computer-readable media includes computer storage media and communication media. The computer storage medium stores information such as computer-readable instructions, data structures, program modules, or other data. Communication media typically embodies computer-readable instructions, data structures, program modules, or other data with a modulated data signal such as a carrier wave or other transmission mechanism, and includes any information delivery media. Combinations of any of the above are also included within the scope of computer-readable media.
应当理解,为了精简本公开并帮助理解各个发明方面中的一个或多个,在上面对本发明的示例性实施例的描述中,本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本发明的单独实施例。It should be understood that, in order to simplify the present disclosure and help to understand one or more of the various aspects of the invention, in the above description of exemplary embodiments of the invention, various features of the invention are sometimes grouped together into a single embodiment, diagram, or In its description. However, this disclosed method should not be construed to reflect the intention that the claimed invention requires more features than are expressly recited in each claim. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single embodiment disclosed previously. Thus, the claims following a specific embodiment are hereby explicitly incorporated into this specific embodiment, wherein each claim itself is a separate embodiment of the present invention.
本领域那些技术人员应当理解在本文所公开的示例中的设备的模块或单元或组件可 以布置在如该实施例中所描述的设备中,或者可替换地可以定位在与该示例中的设备不同的一个或多个设备中。前述示例中的模块可以组合为一个模块或者此外可以分成多个子模块。Those skilled in the art should understand that the modules or units or components of the device in the example disclosed herein may be arranged in the device as described in this embodiment, or alternatively may be positioned differently from the device in this example Of one or more devices. The modules in the foregoing examples may be combined into one module or may be further divided into multiple sub-modules.
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。Those skilled in the art can understand that the modules in the device in the embodiment can be adaptively changed and set in one or more devices different from the embodiment. The modules or units or components in the embodiment may be combined into one module or unit or component, and furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Except for such features and / or processes or units, which are mutually exclusive, all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any methods so disclosed may be employed in any combination or All processes or units of the equipment are combined. Each feature disclosed in this specification (including the accompanying claims, abstract, and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
此外,本领域的技术人员能够理解,尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征,但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如,在下面的权利要求书中,所要求保护的实施例的任意之一都可以以任意的组合方式来使用。In addition, those skilled in the art can understand that although some embodiments described herein include certain features included in other embodiments and not other features, the combination of features of different embodiments is meant to be within the scope of the present invention Within and form different embodiments. For example, in the following claims, any one of the claimed embodiments can be used in any combination.
此外,所述实施例中的一些在此被描述成可以由计算机系统的处理器或者由执行所述功能的其它装置实施的方法或方法元素的组合。因此,具有用于实施所述方法或方法元素的必要指令的处理器形成用于实施该方法或方法元素的装置。此外,装置实施例的在此所述的元素是如下装置的例子:该装置用于实施由为了实施该发明的目的的元素所执行的功能。Furthermore, some of the described embodiments are described herein as methods or combinations of method elements that can be implemented by a processor of a computer system or by other devices that perform the described functions. Thus, a processor having the necessary instructions for implementing the method or method element forms a means for implementing the method or method element. Furthermore, the elements of the device embodiment described here are examples of devices that are used to implement functions performed by the elements for the purpose of implementing the invention.
如在此所使用的那样,除非另行规定,使用序数词“第一”、“第二”、“第三”等等来描述普通对象仅仅表示涉及类似对象的不同实例,并且并不意图暗示这样被描述的对象必须具有时间上、空间上、排序方面或者以任意其它方式的给定顺序。As used herein, unless otherwise specified, the use of the ordinal numbers "first", "second", "third", etc. to describe ordinary objects merely means different instances involving similar objects, and is not intended to imply such The described objects must have a given order in time, space, order, or in any other way.
尽管根据有限数量的实施例描述了本发明,但是受益于上面的描述,本技术领域内的技术人员明白,在由此描述的本发明的范围内,可以设想其它实施例。此外,应当注意,本说明书中使用的语言主要是为了可读性和教导的目的而选择的,而不是为了解释或者限定本发明的主题而选择的。因此,在不偏离所附权利要求书的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。对于本发明的范围,对本发明所做的公开是说明性的,而非限制性的,本发明的范围由所附权利要求书限定。Although the invention has been described in terms of a limited number of embodiments, benefiting from the above description, those skilled in the art will appreciate that other embodiments are contemplated within the scope of the invention as described herein. Furthermore, it should be noted that the language used in this specification was selected primarily for readability and teaching purposes, and not for the purpose of explaining or limiting the subject matter of the present invention. Accordingly, many modifications and variations will be apparent to those skilled in the art without departing from the scope and spirit of the appended claims. As for the scope of the present invention, the disclosure of the present invention is illustrative and not restrictive, and the scope of the present invention is defined by the appended claims.

Claims (18)

  1. 一种垃圾文本的识别方法,所述方法包括步骤:A method for identifying junk text includes the following steps:
    对待识别文本进行文本划分,得到划分结果;Perform text division on the recognized text to get the division result;
    基于所述划分结果为所述待识别文本生成特征向量;Generating a feature vector for the text to be recognized based on the division result;
    将所述特征向量输入多个第一分类模型,以得到所述多个第一分类模型的输出,所述第一分类模型包括线性分类模型和深度学习分类模型;Inputting the feature vector into a plurality of first classification models to obtain an output of the plurality of first classification models, where the first classification model includes a linear classification model and a deep learning classification model;
    至少对所述多个第一分类模型的输出进行组合,得到组合向量;以及Combining at least the outputs of the plurality of first classification models to obtain a combination vector; and
    根据所述组合向量,采用第二分类模型来判断所述待识别文本是否为垃圾文本。According to the combination vector, a second classification model is used to determine whether the text to be recognized is junk text.
  2. 如权利要求1所述的方法,其中,所述基于所述划分结果为所述待识别文本生成特征向量的步骤包括:The method according to claim 1, wherein the step of generating a feature vector for the text to be recognized based on the division result comprises:
    至少采用词袋模型,基于所述划分结果生成所述待识别文本的第一向量;At least a bag-of-words model is used to generate a first vector of the text to be recognized based on the division result;
    采用词嵌入模型,基于所述划分结果生成所述待识别文本的第二向量。A word embedding model is used to generate a second vector of the text to be recognized based on the division result.
  3. 如权利要求2所述的方法,其中,所述将所述特征向量输入多个第一分类模型的步骤包括:The method of claim 2, wherein the step of inputting the feature vector into a plurality of first classification models comprises:
    将基于所述划分结果生成的第一向量输入线性分类模型;Inputting a first vector generated based on the division result into a linear classification model;
    将基于所述划分结果生成的第二向量输入深度学习分类模型。The second vector generated based on the division result is input into a deep learning classification model.
  4. 如权利要求1所述的方法,其中,所述待识别文本为消息,所述消息包括消息签名,所述方法还包括步骤:The method according to claim 1, wherein the text to be recognized is a message, the message includes a message signature, and the method further comprises the step of:
    基于消息签名,计算所述待识别文本为垃圾文本的历史概率;Calculating a historical probability that the text to be identified is junk text based on the message signature;
    相应地,所述至少对所述多个第一分类模型的输出进行组合,得到组合向量的步骤包括:Accordingly, the step of combining at least the outputs of the plurality of first classification models to obtain a combination vector includes:
    对所述历史概率和所述多个第一分类模型的输出进行组合,得到所述组合向量。Combining the historical probability and the outputs of the plurality of first classification models to obtain the combination vector.
  5. 如权利要求4所述的方法,其中,所述基于消息签名,计算所述待识别文本为垃圾文本的历史概率的步骤包括:The method according to claim 4, wherein the step of calculating a historical probability that the text to be identified is junk text based on the message signature comprises:
    获取包括所述待识别文本的消息签名的历史消息;Obtaining a history message including a message signature of the text to be recognized;
    计算所述历史消息中确定为垃圾文本的部分历史消息与所述历史消息的数量之比,以作为所述历史概率。Calculate a ratio of a part of the historical messages determined as junk text to the number of the historical messages as the historical probability.
  6. 如权利要求1或4所述的方法,其中,所述根据所述组合向量,采用第二分类模型来判断所述待识别文本是否为垃圾文本的步骤包括:The method according to claim 1 or 4, wherein the step of using the second classification model to determine whether the text to be recognized is junk text according to the combination vector comprises:
    将所述组合向量输入所述第二分类模型,以得到所述第二分类模型的输出;Inputting the combination vector into the second classification model to obtain an output of the second classification model;
    根据所述第二分类模型的输出来判断所述待识别文本是否为垃圾文本。Determining whether the text to be recognized is junk text according to the output of the second classification model.
  7. 如权利要求6所述的方法,其中,所述第二分类模型包括集成学习分类模型,所述集成学习分类模型包括预定数目个子分类模型,所述将所述组合向量输入所述第二分类模型,以得到所述第二分类模型的输出的步骤包括:The method of claim 6, wherein the second classification model includes an integrated learning classification model, the integrated learning classification model includes a predetermined number of sub-classification models, and the combining vector is input into the second classification model The steps to obtain the output of the second classification model include:
    将所述组合向量分别输入所述集成学习分类模型所包含的每个子分类模型,以便采用投票机制,根据每个子分类模型的输出来确定所述第二分类模型的输出。The combination vector is input into each sub-classification model included in the ensemble learning classification model separately, so as to use a voting mechanism to determine the output of the second classification model according to the output of each sub-classification model.
  8. 如权利要求1或7所述的方法,其中,所述第一分类模型是利用第一训练集、以特征向量为输入训练得到,所述第二分类模型是利用第二训练集、以组合向量为输入训练得到,所述第一训练集和所述第二训练集是从全训练集中抽样得到,所述全训练集包括多个标注有标签的样本,所述标签指示所述样本是否为垃圾文本。The method according to claim 1 or 7, wherein the first classification model is obtained by training using a first training set with feature vectors as input, and the second classification model is obtained by combining a vector with a second training set. For the input training, the first training set and the second training set are sampled from the full training set, and the full training set includes multiple labeled samples, and the labels indicate whether the samples are garbage text.
  9. 如权利要求8所述的方法,其中,所述第二分类模型所包含的子分类模型利用与所述子分类模型相对应的子训练集训练得到,所述子训练集是从所述第二训练集中均匀有放回地抽样得到。The method according to claim 8, wherein the sub-classification model included in the second classification model is trained using a sub-training set corresponding to the sub-classification model, and the sub-training set is obtained from the second training set The training set was sampled evenly and replaced.
  10. 如权利要求1所述的方法,其中,所述线性分类模型利用L1正则化来进行训练,所述深度学习分类模型利用丢弃机制来进行训练。The method of claim 1, wherein the linear classification model is trained using L1 regularization, and the deep learning classification model is trained using a discard mechanism.
  11. 如权利要求1所述的方法,其中,所述线性分类模型包括逻辑回归模型和/或支持向量机模型,所述深度学习分类模型包括卷积神经网络模型和/或循环神经网络模型。The method of claim 1, wherein the linear classification model comprises a logistic regression model and / or a support vector machine model, and the deep learning classification model comprises a convolutional neural network model and / or a recurrent neural network model.
  12. 如权利要求7所述的方法,其中,所述集成学习分类模型包括随机森林模型或梯度提升决策树模型。The method of claim 7, wherein the integrated learning classification model comprises a random forest model or a gradient boosted decision tree model.
  13. 如权利要求1-12中任意一个所述的方法,其中,所述划分结果包括多个划分结果,对于各划分结果,均基于所述划分结果为所述待识别文本生成特征向量,并将所述特征向量输入与所述划分结果相对应的多个第一分类模型,以得到与所述划分结果相对应的多个第一分类模型的输出,所述与所述划分结果相对应的多个第一分类模型以基于所述划分结果生成的特征向量为输入训练得到,The method according to any one of claims 1-12, wherein the division result includes a plurality of division results, and for each division result, a feature vector is generated for the text to be recognized based on the division result, and the The feature vector inputs a plurality of first classification models corresponding to the division result to obtain outputs of the plurality of first classification models corresponding to the division result, and the plurality of corresponding to the division result The first classification model is obtained by training using a feature vector generated based on the division result as an input,
    所述至少对所述多个第一分类模型的输出进行组合,得到组合向量的步骤包括:The step of combining at least the outputs of the plurality of first classification models to obtain a combination vector includes:
    至少对与各划分结果相对应的第一分类模型的输出进行组合,得到所述组合向量。Combine at least the outputs of the first classification model corresponding to the division results to obtain the combination vector.
  14. 如权利要求13所述的方法,其中,对待识别文本进行文本划分,得到多个划分结果的步骤包括:The method according to claim 13, wherein the step of performing text division on the text to be recognized to obtain multiple division results comprises:
    基于分词算法对所述待识别文本进行文本划分,得到包括多个分词的划分结果;Segmenting the text to be recognized based on a segmentation algorithm to obtain a segmentation result including a plurality of segmentations;
    基于n元语言模型对所述待识别文本进行文本划分,得到包括多个n元序列的划分 结果。Text division is performed on the text to be recognized based on the n-gram language model, and a division result including multiple n-gram sequences is obtained.
  15. 如权利要求14所述的方法,其中,所述基于n元语言模型对所述待识别文本进行文本划分,得到包括多个n元序列的划分结果的步骤包括:The method according to claim 14, wherein the step of dividing the text to be recognized based on the n-gram language model to obtain a division result including a plurality of n-gram sequences comprises:
    基于二元语言模型对所述待识别文本进行文本划分,得到包括多个二元序列的划分结果;Text division of the text to be recognized based on a binary language model to obtain a division result including a plurality of binary sequences;
    基于三元语言模型对所述待识别文本进行文本划分,得到包括多个三元序列的划分结果。Text division is performed on the text to be recognized based on a ternary language model, and a division result including multiple ternary sequences is obtained.
  16. 一种垃圾文本的识别装置,包括:A device for identifying junk text includes:
    文本划分单元,适于对待识别文本进行文本划分,得到划分结果;A text division unit, adapted to perform text division on a text to be recognized, and obtain a division result;
    特征学习单元,适于基于所述划分结果为所述待识别文本生成特征向量;A feature learning unit, adapted to generate a feature vector for the text to be recognized based on the division result;
    第一分类单元,适于将所述特征向量输入多个第一分类模型,以得到所述多个第一分类模型的输出,所述第一分类模型包括线性分类模型和深度学习分类模型;以及A first classification unit, adapted to input the feature vector into a plurality of first classification models to obtain outputs of the plurality of first classification models, the first classification model including a linear classification model and a deep learning classification model; and
    特征组合单元,适于至少对所述多个第一分类模型的输出进行组合,得到组合向量;A feature combination unit, adapted to combine at least the outputs of the plurality of first classification models to obtain a combination vector;
    第二分类单元,适于根据所述组合向量,采用第二分类模型来判断所述待识别文本是否为垃圾文本。The second classification unit is adapted to determine whether the text to be recognized is junk text according to the combination vector by using a second classification model.
  17. 一种计算设备,包括:A computing device includes:
    一个或多个处理器;One or more processors;
    存储器;以及Memory; and
    一个或多个程序,其中所述一个或多个程序存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个程序包括用于执行如权利要求1-15所述的方法中的任一方法的指令。One or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs include means for executing Instructions of any of the methods described in -15.
  18. 一种存储程序的可读存储介质,所述程序包括指令,所述指令当由计算设备执行时,使得所述计算设备执行如权利要求1-15所述的方法中的任一方法。A readable storage medium storing a program, wherein the program includes instructions that, when executed by a computing device, cause the computing device to perform any of the methods of claims 1-15.
PCT/CN2019/105348 2018-09-17 2019-09-11 Junk text identification method and device, computing device and readable storage medium WO2020057413A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811083369.8A CN110929025B (en) 2018-09-17 2018-09-17 Junk text recognition method and device, computing equipment and readable storage medium
CN201811083369.8 2018-09-17

Publications (1)

Publication Number Publication Date
WO2020057413A1 true WO2020057413A1 (en) 2020-03-26

Family

ID=69855841

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/105348 WO2020057413A1 (en) 2018-09-17 2019-09-11 Junk text identification method and device, computing device and readable storage medium

Country Status (2)

Country Link
CN (1) CN110929025B (en)
WO (1) WO2020057413A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112131379A (en) * 2020-08-20 2020-12-25 彭涛 Method, device, electronic equipment and storage medium for identifying problem category
CN112560463A (en) * 2020-12-15 2021-03-26 中国平安人寿保险股份有限公司 Text multi-labeling method, device, equipment and storage medium
CN113535944A (en) * 2020-04-21 2021-10-22 阿里巴巴集团控股有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN113590812A (en) * 2020-04-30 2021-11-02 阿里巴巴集团控股有限公司 Screening method and device of junk text training samples and electronic equipment
CN113869431A (en) * 2021-09-30 2021-12-31 平安科技(深圳)有限公司 False information detection method, system, computer device and readable storage medium
CN114817526A (en) * 2022-02-21 2022-07-29 华院计算技术(上海)股份有限公司 Text classification method and device, storage medium and terminal
CN116564538A (en) * 2023-07-05 2023-08-08 肇庆市高要区人民医院 Hospital information real-time query method and system based on big data
CN116975863A (en) * 2023-07-10 2023-10-31 福州大学 Malicious code detection method based on convolutional neural network

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111625645B (en) * 2020-05-14 2023-05-23 北京字节跳动网络技术有限公司 Training method and device for text generation model and electronic equipment
CN111711618A (en) * 2020-06-02 2020-09-25 支付宝(杭州)信息技术有限公司 Risk address identification method, device, equipment and storage medium
CN113723096A (en) * 2021-07-23 2021-11-30 智慧芽信息科技(苏州)有限公司 Text recognition method and device, computer-readable storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541958A (en) * 2010-12-30 2012-07-04 百度在线网络技术(北京)有限公司 Method, device and computer equipment for identifying short text category information
CN106960040A (en) * 2017-03-27 2017-07-18 北京神州绿盟信息安全科技股份有限公司 A kind of URL classification determines method and device
WO2017190527A1 (en) * 2016-05-06 2017-11-09 华为技术有限公司 Text data classification method and server
CN107734131A (en) * 2016-08-11 2018-02-23 中兴通讯股份有限公司 A kind of short message sorting technique and device
CN107844558A (en) * 2017-10-31 2018-03-27 金蝶软件(中国)有限公司 The determination method and relevant apparatus of a kind of classification information

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887523B (en) * 2010-06-21 2013-04-10 南京邮电大学 Method for detecting image spam email by picture character and local invariant feature
US20150310862A1 (en) * 2014-04-24 2015-10-29 Microsoft Corporation Deep learning for semantic parsing including semantic utterance classification
US10331782B2 (en) * 2014-11-19 2019-06-25 Lexisnexis, A Division Of Reed Elsevier Inc. Systems and methods for automatic identification of potential material facts in documents
CN107515873B (en) * 2016-06-16 2020-10-16 阿里巴巴集团控股有限公司 Junk information identification method and equipment
CN106599933A (en) * 2016-12-26 2017-04-26 哈尔滨工业大学 Text emotion classification method based on the joint deep learning model
CN107507038B (en) * 2017-09-01 2021-03-19 美林数据技术股份有限公司 Electricity charge sensitive user analysis method based on stacking and bagging algorithms
CN107885853A (en) * 2017-11-14 2018-04-06 同济大学 A kind of combined type file classification method based on deep learning
CN107943941B (en) * 2017-11-23 2021-10-15 珠海金山网络游戏科技有限公司 Junk text recognition method and system capable of being updated iteratively

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102541958A (en) * 2010-12-30 2012-07-04 百度在线网络技术(北京)有限公司 Method, device and computer equipment for identifying short text category information
WO2017190527A1 (en) * 2016-05-06 2017-11-09 华为技术有限公司 Text data classification method and server
CN107734131A (en) * 2016-08-11 2018-02-23 中兴通讯股份有限公司 A kind of short message sorting technique and device
CN106960040A (en) * 2017-03-27 2017-07-18 北京神州绿盟信息安全科技股份有限公司 A kind of URL classification determines method and device
CN107844558A (en) * 2017-10-31 2018-03-27 金蝶软件(中国)有限公司 The determination method and relevant apparatus of a kind of classification information

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535944A (en) * 2020-04-21 2021-10-22 阿里巴巴集团控股有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN113590812B (en) * 2020-04-30 2024-03-05 阿里巴巴集团控股有限公司 Junk text training sample screening method and device and electronic equipment
CN113590812A (en) * 2020-04-30 2021-11-02 阿里巴巴集团控股有限公司 Screening method and device of junk text training samples and electronic equipment
CN112131379A (en) * 2020-08-20 2020-12-25 彭涛 Method, device, electronic equipment and storage medium for identifying problem category
CN112560463B (en) * 2020-12-15 2023-08-04 中国平安人寿保险股份有限公司 Text multi-labeling method, device, equipment and storage medium
CN112560463A (en) * 2020-12-15 2021-03-26 中国平安人寿保险股份有限公司 Text multi-labeling method, device, equipment and storage medium
CN113869431A (en) * 2021-09-30 2021-12-31 平安科技(深圳)有限公司 False information detection method, system, computer device and readable storage medium
CN113869431B (en) * 2021-09-30 2024-05-07 平安科技(深圳)有限公司 False information detection method, system, computer equipment and readable storage medium
CN114817526A (en) * 2022-02-21 2022-07-29 华院计算技术(上海)股份有限公司 Text classification method and device, storage medium and terminal
CN114817526B (en) * 2022-02-21 2024-03-29 华院计算技术(上海)股份有限公司 Text classification method and device, storage medium and terminal
CN116564538A (en) * 2023-07-05 2023-08-08 肇庆市高要区人民医院 Hospital information real-time query method and system based on big data
CN116564538B (en) * 2023-07-05 2023-12-19 肇庆市高要区人民医院 Hospital information real-time query method and system based on big data
CN116975863A (en) * 2023-07-10 2023-10-31 福州大学 Malicious code detection method based on convolutional neural network

Also Published As

Publication number Publication date
CN110929025A (en) 2020-03-27
CN110929025B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
WO2020057413A1 (en) Junk text identification method and device, computing device and readable storage medium
US11734329B2 (en) System and method for text categorization and sentiment analysis
US11675977B2 (en) Intelligent system that dynamically improves its knowledge and code-base for natural language understanding
US10949709B2 (en) Method for determining sentence similarity
US10209782B2 (en) Input-based information display method and input system
US20210201143A1 (en) Computing device and method of classifying category of data
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
US11727211B2 (en) Systems and methods for colearning custom syntactic expression types for suggesting next best correspondence in a communication environment
WO2017118427A1 (en) Webpage training method and device, and search intention identification method and device
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
WO2018205084A1 (en) Providing local service information in automated chatting
US9158839B2 (en) Systems and methods for training and classifying data
CN108664574A (en) Input method, terminal device and the medium of information
WO2014117553A1 (en) Method and system of adding punctuation and establishing language model
WO2021063089A1 (en) Rule matching method, rule matching apparatus, storage medium and electronic device
WO2022141875A1 (en) User intention recognition method and apparatus, device, and computer-readable storage medium
WO2022257452A1 (en) Meme reply method and apparatus, and device and storage medium
US11676410B1 (en) Latent space encoding of text for named entity recognition
EP3928221A1 (en) System and method for text categorization and sentiment analysis
Yang et al. Enhanced twitter sentiment analysis by using feature selection and combination
CN113051380A (en) Information generation method and device, electronic equipment and storage medium
US11922515B1 (en) Methods and apparatuses for AI digital assistants
CN111555960A (en) Method for generating information
CN107729509B (en) Discourse similarity determination method based on recessive high-dimensional distributed feature representation
US11640233B2 (en) Foreign language machine translation of documents in a variety of formats

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19861952

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19861952

Country of ref document: EP

Kind code of ref document: A1