CN109670041A - A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods - Google Patents
A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods Download PDFInfo
- Publication number
- CN109670041A CN109670041A CN201811446969.6A CN201811446969A CN109670041A CN 109670041 A CN109670041 A CN 109670041A CN 201811446969 A CN201811446969 A CN 201811446969A CN 109670041 A CN109670041 A CN 109670041A
- Authority
- CN
- China
- Prior art keywords
- character
- text
- convolutional neural
- neural networks
- binary channels
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/126—Character encoding
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/191—Automatic line break hyphenation
Abstract
It makes an uproar illegal short text recognition methods the present invention relates to a kind of band based on binary channels text convolutional neural networks.It makes an uproar the pretreatment of short text, the building of binary channels text convolutional neural networks model and the training of model and identification in real time including band.The make an uproar pretreatment of short text of band is used for the standardization of noise character, eliminates the influence of noise, improves the learning ability of convolutional neural networks model.Binary channels text convolutional neural networks model is the text convolutional neural networks model that can input character string and pinyin sequence after pretreatment simultaneously.Due to increasing input and the modeling ability of pinyin sequence, which, which can eliminate the replacement of unisonance character, influences classification performance.The present invention, which is capable of handling the brings such as the replacement of unisonance character, the similar English character replacement of shape, the identical numerical chracter replacement of various semantemes, to be influenced, and experimental result shows the method for the present invention to the identification recognition accuracy with higher of the illegal short text with noise and lower false detection rate.
Description
Technical field
The invention belongs to Computer Natural Language Processing field, it is related to a kind of based on binary channels text convolutional neural networks
Band is made an uproar illegal short text recognition methods.
Background technique
With the fast development of network, carrying out information, the sharing of viewpoint and communication by network becomes current network application
Important way.For example, being discussed by BBS to certain problems;Viewpoint, news and comment are delivered by microblogging;By i.e.
When means of communication exchanged;It is commented on by the review pages in news website;It is exchanged by net cast service;
And currently popular video content comment on by barrage in video playing etc..This user generates the mould of content
The information that formula facilitates user is shared and is exchanged.But this internet content published method is also easy to be utilized by criminal,
Some illegal advertising informations are issued, wherein porno advertisement is presently most main illegal flame, these information are general
It is all issued by way of short text, and directs the user to porn site, the QQ and wechat account of sex service are provided
On.In order to prevent the propagation of these invalid informations, website and various apply need to construct special processing routine in server end
Automatic identification is carried out to the content that user submits, is confirmed whether to be invalid information.If it is invalid information, then publication is prevented to believe
Breath, and correlation function is closed to illegal account.
Presently most common flame identification and filter method are the methods based on keyword filtering.This method needs
One illegal lists of keywords of building in advance.There are the words in lists of keywords to be searched whether to the content of text of submission, if
In the presence of being then considered illegal contents.This method treatment effeciency is high, but misclassification rate is also high.It can identification include mistakenly key
The normal text content of word.In order to cope with this problem, it is thus proposed that the method based on text classification.Such methods generally will
The text representation of input is vector space model, according to the character of appearance or word construction feature vector, and combines TF*IDF
To express the importance of word or character feature.Then classified using the method for statistical machine learning to feature vector.Often
Classifier has: support vector machines, adaboost, neural network, decision tree etc..Such methods can reduce text to a certain extent
The false detection rate of this identification.But since the character that short text includes is limited, and the context relation between character is not accounted for,
The recognition accuracy of such method is still difficult to meet the requirement of practical application.
In addition, since illegal user can also improve the form of publication content according to identifying system, thus the knowledge of avoidance system
Not.The current major way escaping identification and filtering is to carry out variant to the keyword in short text content.For example, by ' invoice '
It is write as " hair drift ", ' naked to chat ' is write into as ' falling merely ' etc..Count existing some illegal pornographic advertising informations, discovery there is currently
Chief word variant form are as follows: (1) be mingled with additional character (the usually non-legible character of punctuate class), such as: " QQ296 "
161『7102";(2) character similar in shape substitutes, for example, " day " replaces with " saying ";(3) unisonance or the replacement of nearly sound, such as:
Micro-replaces with " for ";(4) phonetic is replaced, and " wechat " replaces with " wei xin ";(5) keyword inverted order or whole sentence inverted order;
(6) Chinese character splits into radical and other characters, such as: " naked " is split as " Yi fruit ".(7) keyword traditional font;(8) to English
Text and number etc. are interted using similar shape character, such as: " a5m2coM ";(9) to numerical character be converted to number form or
Person's Chinese figure etc., for example: " Wei 765510103 is (ii) ".
In order to cope with the variant form of keyword, a kind of method for generalling use keyword expansion.This method is by keyword
Table constructs the variant form of keyword, and a part as antistop list according to possible variant form.In addition, in order to cope with
Keyword variant, Wen Yuanxu propose extract variant feature method [variant short text filter algorithm research, 2012.12, Beijing
University of Post and Telecommunication, master thesis].This method proposes to become using some regular construction features to express the keyword being likely to occur
Then body form is identified using Bayes classifier.But variant feature is extracted by the way of building rule and is easy quilt
Illegal user's identification, to further escape the identification of system by more neomorphic producing method.In addition, artificial building becomes
Body characteristics are a relatively difficult job.
It is difficult to be effectively treated keyword variant for current method and traditional short text classification method accuracy rate is not high asks
Topic, the method for the present invention proposes the variant feature having found that it is likely that in the sample using depth learning technology, and excavates between character
Correlative character, improve band make an uproar short text classification accuracy rate.This method utilizes the powerful learning ability of deep learning, and
It is easy to be updated according to sample, so as to quickly cope with emerging variant form.The inventive method can be applied to various nets
It stands and the server end of Internet application, the automatic identification with noise short text content that user is submitted in realization prevents nocuousness
The propagation of information.
Summary of the invention
In view of the deficiencies of the prior art, the present invention provides a kind of band based on binary channels text convolutional neural networks make an uproar it is non-
Method short text recognition methods.The method of the present invention mainly includes three parts: band is made an uproar the pretreatment of short text, binary channels text convolution
The building of neural network model and the training of binary channels text convolutional neural networks model and in real time identification.Band is made an uproar short text
Pretreatment be influence for reducing noise character to subsequent convolutional neural networks model training and identification.Binary channels text volume
The building of product neural network model is the nerve net replaced this noise like addition mode for unisonance character and carry out targeted design
Network model.The training of binary channels text convolutional neural networks model and in real time identification then use the relevant technologies of deep learning.
(1) band make an uproar short text pretreatment specifically include that numerical character standardization, English character standardization, Chinese-traditional
Character turn simplified Chinese character, Special Significance Symbol processing, removal be mingled with noise symbol, continuous number character unified representation, word
Symbol string cutting and Chinese character turn 7 treatment processes of pinyin representation.Numerical character standardization and English character standardization are by shape
Shape is similar or semantic congruence but the different numerical character and English character of character code are converted to half-angle number and small English
Character.Numerical character standardization is that one numerical character of building corresponds to table, having under being encoded all unicode by the table
The symbol of the various different codings of numerical significance is converted to the numerical character coding of standard.Such as: will " 7. ", " zero ", " (ii) " point
Be not converted to " 7 ", " 0 ", " 2 " pattern standard digital character.English character standardization is by all with similar shape
The letter character of different coding is converted to the small English character of standard.Such as: it willBe respectively converted into " a ",
The standard small English character of " f ", " k " pattern.This method is realized by constructing the table of comparisons of a dictionary structure.Traditional font
It is that traditional Chinese character that may be present in character string is converted to familiar form that Chinese character, which turns simplified Chinese character,.This method
Table by constructing a corresponding relationship is realized.Special Significance symbol conversion process is by some symbols to acquire a special sense
The symbol of corresponding feature meaning number is converted to, such as: the symbol by shapes such as " ╁ " similar to "+" is converted to Chinese character
" adding ".This method searches the symbol of Special Significance that may be present by the analysis to corpus, and it is special to construct table of comparisons realization
The conversion of meaning character.Removal be mingled with noise symbol be the short text obtained after above-mentioned character conversion operation is filtered out it is non-
All symbols of Chinese character, non-English character and nonnumeric character, such as: " ˇ " in removal " scratching stingy 251 ˇ, 764 ˇ 5947 ".
Continuous number character representation is that numerical character continuous in short text is expressed as "<num_n>" according to the number of numerical character
Form, wherein n indicates the number of continuous numerical character, such as: " 2517645947 will be scratched " and be expressed as " scratching < num_10
>".Expression in this way can reduce the sparse influence of numerical character sample.Character string cutting is converted by above-mentioned character
Cutting is carried out to short text afterwards.In the method, Chinese character is a character by independent cutting, and continuous English character is cut
It is divided into a unit, continuous numerical character uses "<num_n>" as a unit.Such as: " jia scratches stingy<num_10>" is cut
It is divided into " jia scratches stingy<num_10>".It is to be expressed as Chinese character on the basis of character string cutting that Chinese character, which turns pinyin representation,
Its corresponding not toned phonetic.Such as: " jia scratches stingy<num_10>" is converted into " jia kou kou<num_10>".
This method equally realizes the conversion of Chinese character to phonetic by dictionary format.
The influence of most of noise character can be eliminated by the above pre-treatment step, when reducing deep learning model training
The sample size needed, and method of assuring has certain adaptability to the noise character not occurred in identification.
(2) construction of binary channels text convolutional neural networks model;Pre-treatment step can eliminate most of noise character
Influence, but the unisonance character for being difficult to eliminate Chinese character replaces the influence of this difficulties.Based on this, this method is created
One can input the text convolutional neural networks model of character string and pinyin sequence after pretreatment simultaneously, to guarantee model
The replacement of unisonance character, which can be eliminated, influences classification performance.Twin-channel text convolutional neural networks model structure is as shown in Figure 3.
There are two text convolutional neural networks models to constitute for the convolutional neural networks.The input of one of them is character string, another
Input be pinyin sequence.The network structure can capture the information of character and phonetic simultaneously, and what the replacement of unisonance character generated makes an uproar
Sound text can be solved by pinyin sequence.Its structure is the embeding layer of a term vector first, for by character (word,
Phonetic) be converted to term vector expression;Then the term vector of sentence is indicated to carry out convolution, a convolution kernel according to the scale of convolution
Obtain several convolution results;Then nonlinear activation function is carried out to all convolution results and carries out Nonlinear Processing;It is right again
Processing result maximum value pond, each filter (single convolution kernel) obtain a value.Finally by all filters
Output is input to softmax by full articulamentum and classifies.Twin-channel text convolutional neural networks are divided into the method
Character text convolutional neural networks and phonetic text convolutional neural networks.Two text convolutional neural networks can set different
Term vector length, vocabulary, different convolution scales.Most latter two text convolutional neural networks, will using maximum value Chi Huahou
Obtained characteristic value is stitched together, and is input in softmax by full articulamentum.
(3) training of binary channels text convolutional neural networks model and in real time identification.In order to train binary channels text convolution
Neural network needs to construct a sample database.Sample is divided into positive and negative two classes sample, respectively indicates: illegal short text and normal
Short text.Each sample is character string by preprocessing transformation, while character string is converted into pinyin sequence (encumbrance
Word character and English character are not converted).Character string and corresponding pinyin sequence are separately input to binary channels text when training
The corresponding input item of convolutional neural networks model.And corresponding sample label is given, 0 indicates normal, and 1 indicates illegal.This text
The loss function set when convolutional neural networks model training are as follows:
Loss=tf.reduce_mean (loss1)+lambda*l2_loss.
Wherein
Parameter over-fitting increases parameter regular terms to l2_loss in order to prevent.The regular terms acts on complete before softmax
The weight of articulamentum.Loss1 is entropy loss function of reporting to the leadship after accomplishing a task.It is first to the output of full articulamentum (value that output is each classification)
Softmax functional operation is carried out, to convert the output into the probability value for belonging to each class;Then to the defeated of softmax function
It reports to the leadship after accomplishing a task out with the label of authentic specimen (classification) entropy.Tf.reduce_mean function is used to calculate in loss1 a batch's
It averagely reports to the leadship after accomplishing a task entropy.Therefore, the loss function of this method includes intersecting entropy function and the loss of weight regular terms, and wherein Lambda is two
Weight between person.After setting last function, method set optimal method, by gradient descent algorithm come calculating parameter most
Excellent solution.This method is using Adam optimal method.In training, all samples are all passed through pretreatment and turned by this method
It is changed to character string and pinyin sequence;Then the vocabulary of character and phonetic is constructed respectively according to training sample;Further according to character
The character in training sample is converted into digital id (subscript that essence is vocabulary) with phonetic respective vocabulary;Then together with
Sample label is separately input to the channel of the character string of binary channels text convolutional neural networks model and the channel of pinyin sequence.
Be ready to after training data can the continuous data of one batch of iteration, it is real and by the gradient of loss function come undated parameter
The training of existing model.
Binary channels text convolutional neural networks model parameter is trained, method can carry out real-time band by the model
It makes an uproar the identification of short text.Firstly, it is necessary to import model and model parameter, and the vocabulary that while importing trained constructs;Then, need to
Band to be tested short text of making an uproar is pre-processed to obtain sequence of words and pinyin sequence;Then two sequences are separately input to double
In the corresponding character string channel of channel text convolutional neural networks model and pinyin sequence channel.Last model calculates
The classification of judgement can be obtained in the value of softmax.
The present invention has the advantages that compared with the existing technology
The present invention can be used for band and make an uproar the identification of illegal short text, improve file classification method to adding various man made noises'
The accuracy rate and robustness of illegal short text identification;Especially the method for the present invention can effectively be identified and can not be currently effectively treated
The illegal short text of various phonetically similar word replacements.The experiment proves that the validity based on binary channels text convolutional neural networks.
Detailed description of the invention
Fig. 1 shows flow charts of the invention;
Fig. 2 indicates pretreatment process figure in the method for the present invention;
Fig. 3 indicates the binary channels text convolutional neural networks model in the method for the present invention;
The situation of change of the training process accuracy rate of Fig. 4 binary channels text convolutional neural networks;Dark strokes is shallow in figure
Result after colo(u)r streak smoothing processing;
The situation of change of (loss), dark line in figure are lost in the training process of Fig. 5 binary channels text convolutional neural networks
Item is the result after light lines smoothing processing.
Specific embodiment
The present invention is described in detail below in conjunction with attached drawing, it is noted that described embodiment is only convenient for pair
Understanding of the invention, and do not play any restriction effect to it.
The method of the present invention is not limited to handle pornographic text information, and other similar illegal advertising informations can equally be had
Effect processing, such as: the various advertisement texts drawn a bill, it is only necessary to which the sample information for collecting associated class can be obtained by study
Corresponding identifier.In the present embodiment, main object to be processed is pornographic popularization text, i.e., various non-in the identification network platform
The various porno advertisement text informations of method user publication, it is existing illegal for breaking through that these information are largely all added to noise
The detection system of text.The present embodiment is realized using this deep learning frame of tensorflow, but the building of model and instruction
White silk can equally use other deep learning frames.
It is further described the embodiment of the present invention below with reference to the accompanying drawings.
Fig. 1 is a flow diagram, illustrates various pieces relationship and its process of the present invention.One kind being based on binary channels text
The band of convolutional neural networks is made an uproar illegal short text recognition methods, specifically includes following part: band is made an uproar the pretreatment of short text, bilateral
Road text convolutional neural networks model construction, the training of binary channels text convolutional neural networks model and based on binary channels text roll up
The identification of product neural network model.Preprocessing part is used to the various characters in short text that band is made an uproar being unified for standard character,
Noise is removed, and carries out cutting.Binary channels text convolutional neural networks model construction mainly creates an adaptation noise text
The text convolutional neural networks model of identification.The training of binary channels text convolutional neural networks model is the network mould based on building
Type and the loss function of definition obtain the parameter of network model by sample data and training parameter setting.Based on binary channels text
The identification of this convolutional neural networks model is that short text is made an uproar to the band of input according to trained obtained network model parameter progress reality
When classification.The realization of entire method includes two basic processes: model training process and real-time recognition process.In model training
When process, first choice needs to design binary channels text convolutional neural networks model;Then it goes forward side by side to trained sample by pretreatment
The data preparation of row training sample;Then by the parameter of setting training parameter training convolutional neural networks model, training reaches
Model is saved after training requirement.In real-time recognition process, by the band of input make an uproar short text pre-process then it is defeated
Entering into binary channels text convolutional neural networks model to carry out identification in real time can be obtained recognition result.It is described in detail below each
Partial realization.
(1) it pre-processes
Pre-treatment step is handled the band of input short text of making an uproar by 7 treatment processes, and the detailed process of processing is shown in
Attached drawing 2.The result of processing is used directly for the training or identification in real time of binary channels text convolutional neural networks model.Pretreatment
Target be reduce noise influence, although noise symbol can also be used as vocabulary carry out text convolutional neural networks training,
But due to the sparsity of the diversity of noise symbol addition and training sample data, these noises are directly removed by pretreatment
Symbol can preferably cope with the noise addition operation of various symbols transformation.The pre-treatment step of this method includes: numerical character
Standardization, English character standardization, traditional Chinese character turn simplified Chinese character, Special Significance Symbol processing, remove to be mingled with and make an uproar
Sound symbol, continuous number character unified representation, character string cutting and Chinese character turn 7 treatment processes of pinyin representation.
Numerical character standardization is that the lower coded identification with numerical significance of all unicode coding is converted to standard
Numerical character coding.Such as: will " 7. ", " zero ", " (ii) " be respectively converted into " 7 ", " 0 ", " 2 " pattern standard digital character.For
Realize this conversion process, this method is by checking all symbols with numerical significance under unicode is encoded, according to number
The order of character number constructs the corresponding relationship of code conversion.Such as: " 1. " arrivesSymbol in unicode coding schedule
It is ordered into arrangement, for " 1. " arrivingThis method is realized by following formula and is converted: ch_out=chr (ord (ch)-
ord("①")+ord("1")).Wherein ord function is the unicode coding for obtaining character, and chr function is compiled by unicode
Code obtains corresponding character.Similar, for ' 10. '' 1 '+chr of ch_out=(ord (ch)-ord (" 10. ")
+ord("0"));ForThen specially treated, if ch isThen exporting indirect assignment is " 20 ".In the same way,
The numerical character standardization of this method: ' 0'≤ch≤' 9', " (i) "≤ch <
=" (x) ", ' I '≤ch≤' Ⅻ ',For ' 1. '≤ch≤' 20. ' and Then added behind the result of conversion
' point ' symbol.For' day ' symbol is then added behind the result of conversion.For " 012345
Character in 67890 " and " 12345 lands 789 " then obtains input character in the character by index function
Then the integer value of position is converted to character by str function by the position in string.And for the symbol of nonnumeric meaning, number
Word character standard directly returns to original character without processing.
English character standardization is that the letter character of all different codings with similar shape is converted to standard
Small English character.Such as: it willIt is respectively converted into the standard small English character of " a ", " f ", " k " pattern.
Since similar letter character does not have continuously being encoded for unification in unicode coding schedule, this method passes through structure
The table of comparisons for building a dictionary structure is realized.For this purpose, this method similar letter character of one file storage shape;It will
The similar letter character of shape saves as a line in file in unicode.Such as:Wherein first ' h ' is standard alphabet, and will with ": "
The similar letter of shape separates.Subsequent similar letter passes through space-separated.This method constructs all 26 English alphabets
Shape similarity sign corresponding relationship.By this document, method can create a dictionary structure, and key therein is shape similar character
Symbol, is worth for standard English character.The conversion of character is realized by searching for the dictionary.Capitalization is then passed through and is commonly capitalized
Letter turns the realization of lowercase function.English alphabet standardization then exports non-English letter as former state.
It is that traditional Chinese character that may be present in character string is converted to letter that traditional Chinese character, which turns simplified Chinese character,
Body form.This method is realized by constructing the table of a corresponding relationship.The table is by searching for institute in unicode coding
There are the different all Chinese characters of simplified and traditional font to realize.It equally uses dictionary structure and realizes quickly conversion.It realizes
When used this python packet of zhtools.The packet constructs simplified and traditional font corresponding relationship.
Special Significance symbol conversion process is that some symbols to acquire a special sense are converted to corresponding certain sense
Symbol.Such as: shapes such as " ╁ " are converted into Chinese character " adding " similar to the symbol of "+".Because of the character of these Special Significances
Specific semantic information is represented, and important.Many noise addition modes be exactly by by it is significant it is specific in
Chinese character is converted to this kind of character, and is replaced by shape similarity.For this purpose, this method is similar to similar shape English
The conversion method of letter is realized, a table of comparisons is constructed, such as: add: ++ ╂ ╃ ╁ ╀ ┿ ┾ ┽ ┼ ╋.Finally by
Dictionary structure realizes conversion.This method searches the symbol of Special Significance that may be present, and building pair by the analysis to corpus
The conversion of Special Significance character is realized according to table.
It is to filter out non-middle text to the short text obtained after above-mentioned character conversion operation that removal, which is mingled with noise symbol,
All symbols of symbol, non-English character and nonnumeric character, such as: " ˇ " in removal " scratching stingy 251 ˇ, 764 ˇ 5947 ".Standardization
Character string this operation after character is easier to realize, directly removes above-mentioned non-Chinese character, non-English character and nonnumeric
All symbols of character.
Continuous number character representation is will be by numerical character continuous in the short text that handles above according to numerical character
Number be expressed as the form of "<num_n>", wherein n indicates the number of continuous numerical character.Such as: it " will scratch
2517645947 " are expressed as " scratching stingy<num_10>".The purpose indicated in this way is the diversity and instruction in order to eliminate numerical value
Practice the sparsity of data.Such as: QQ number code will be generally above 7 bit digital characters, and number related with quantity is generally less than 7
Digit to facilitate the feature extraction and training of subsequent deep learning model, and can be coped with and not occur in training sample
QQ number code.
Character string cutting is to carry out cutting to short text after the conversion of above-mentioned character.In the method, Chinese character quilt
Independent cutting is a character, and continuous English character is split as a unit, continuous numerical character "<num_n>" work
For a unit.Such as: " jia scratches stingy<num_10>" is split as " jia scratches stingy<num_10>".
It is that Chinese character is expressed as its on the basis of character string cutting is corresponding without sound that Chinese character, which turns pinyin representation,
The phonetic of tune.Such as: " jia scratches stingy<num_10>" is converted into " jia kou kou<num_10>".This method equally passes through word
Allusion quotation form realizes the conversion of Chinese character to phonetic.This python packet of pinyin is used when realization, it is direct to Chinese character
Call get method therein that can obtain the phonetic of corresponding Chinese character.
(2) building of binary channels text convolutional neural networks model
Pre-treatment step can eliminate the influence of most of noise character, but the phonetically similar word for being difficult to eliminate Chinese character is replaced
Change the influence of this difficulties.Since Chinese unisonance character is more, only by character string training text convolutional Neural net
The training of network model is difficult to capture the information for not having the unisonance Chinese character occurred in sample, to make in paired samples without going out
Existing unisonance character can not be identified effectively.Based on this, this method constructs a binary channels text convolutional neural networks
Model, while character string and pinyin sequence after pretreatment are inputted, to guarantee that model copes with the shadow of unisonance character replacement
It rings.Twin-channel text convolutional neural networks model structure is as shown in Figure 3.There are two text convolutional Neurals for the convolutional neural networks
Network model is constituted.The input of one of them is character string, another input is pinyin sequence.Single text convolutional Neural
Network architecture includes: the embeding layer of a term vector, is indicated for character to be converted to vector;Then according to the ruler of convolution
Degree carries out convolution ,-filter_size+1 convolution results of a convolution kernel available len (sequence);Then to all
Convolution results carry out ReLu activation primitive carry out Nonlinear Processing;Again to processing result maximum value pond, each
Filter obtains a value.All filter values are finally input to softmax by full articulamentum to classify.In we
Twin-channel text convolutional neural networks are divided into character text convolutional neural networks and phonetic text convolutional neural networks in method.Two
A text convolutional neural networks can set different term vector length, vocabulary, different convolution scales.Most latter two text volume
Obtained characteristic value is stitched together by product neural network using maximum value Chi Huahou, and is input to by full articulamentum
In softmax.In the present embodiment, character text convolutional neural networks model and phonetic text convolutional neural networks model are set
In word embeding layer term vector length be 128.The scale of the convolutional layer in two channels is set as (3,4,5), that is, convolution
The size of core is respectively 3,4,5.It is respectively the relationship between 3,4,5 characters so as to capture span.In addition, to each volume
128 filters are arranged in product scale again.There is 3*128 filter in such a channel, and there is 2*3*128 filtering in two channels
Device.Nonlinear function is set as Relu function.In pond, layer uses maximum value pond, that is, the result that a filter returns
Output result of the maximum value as pond layer is taken in vector.For the present embodiment, available 2*3*128 characteristic value.
(3) training of binary channels text convolutional neural networks model
After building the network model of deep learning, method needs to set loss function, to be obtained by optimizing
Model parameter optimal on training sample set.The loss function set when this text convolutional neural networks model training are as follows:
Loss=tf.reduce_mean (loss1)+lambda*l2_loss
Wherein parameter over-fitting increases parameter regular terms to l2_loss in order to prevent.Before the regular terms acts on softmax
Full articulamentum weight.Loss1 is entropy loss function of reporting to the leadship after accomplishing a task.The present embodiment uses tensorflow function softmax_
Cross_entropy_with_logits is realized.It is first to the output of full articulamentum (value that output is each classification)
Softmax functional operation is carried out,
To convert the output into the probability value for belonging to each class;Then the output to softmax function and authentic specimen
Label (classification) report to the leadship after accomplishing a task entropy.Tf.reduce_mean function is used to calculate the entropy of reporting to the leadship after accomplishing a task that is averaged of a batch in loss1.
Therefore, the loss function of this method includes intersecting entropy function and the loss of weight regular terms, and wherein Lambda is power between the two
Weight.
After setting last function, method sets optimal method, by gradient descent algorithm come the optimal solution of calculating parameter.
This method is using Adam optimal method.Adam algorithm is according to loss function to the single order moments estimation of the gradient of each parameter
The learning rate of each parameter is directed to second order moments estimation dynamic adjustment.The Learning Step of Adam method has a range, no
Can be because generating very big gradient on some sample lead to very big Learning Step, Parameters variation is more stable.
After loss function and optimal method has been determined, need to prepare training of the training sample for parameter.The present embodiment
Training sample derive from net cast platform, most of porno advertisement information is sent out by advertisement robot on platform.Platform is logical
The behavior pattern of posting for crossing user obtains porno advertisement short text 130,000, normal short text 80,000, total number of samples 210,000
Item wherein 90% will be used as training sample, and remaining 10% is used as test sample.Training when, this method by all samples all
Character string and pinyin sequence are converted to by pretreatment.Then the vocabulary of character and phonetic is constructed respectively according to training sample
Table.The character in training sample is converted to digital id further according to character and phonetic respective vocabulary, and (essence is vocabulary
Subscript).In the present embodiment, pass through tensorflow.contrib.learn.preprocessing.VocabularyPro
Cessor is realized.It is saved after the vocabulary of creation.
It needs to set training parameter before starting deep learning model training, in the present embodiment, sets batch_size
It is 32, that is, 32 samples of input update model parameter for calculating average gradient simultaneously.Num_epoches is set as
3, that is, three times to the training of all training samples, operation (shuffle) is upset at random to training sample progress before starting every time,
It can have been restrained completely according to laboratory num_epoches for 3.Checkpoint_every is set as 1000, that is, instructs
After having practiced 1000 batches, a drag is saved.Num_checkpoints is set as 3, and three moulds are at most saved when training
Type.
After the setting of above-mentioned training parameter and the preparation of training sample, so that it may start to train.In the present embodiment, training
The accuracy rate situation of change of model in the process is as shown in Figure 4.The abscissa of Fig. 4 is the batch number of training.It can be sent out from figure
Existing, by the training of 10000 batch, the accuracy rate of model has averagely reached the loss variation of the model of 0.94. training process
Situation is as shown in Figure 5.
(4) identification based on binary channels text convolutional neural networks model
Binary channels text convolutional neural networks model parameter is trained, method can carry out real-time band by the model
It makes an uproar the identification of short text.Firstly, it is necessary to import model and model parameter, and the vocabulary that while importing trained constructs.In this implementation
In example, model and model parameter are imported using the restore of the tf.train.import_meta_graph class of TensorFlow
Method;Vocabulary is imported using the VocabularyProcessor.restore function in TensorFlow.Then, it will need
The band of test short text of making an uproar is pre-processed to obtain sequence of words and pinyin sequence, and two sequences are then separately input to bilateral
In the corresponding character string channel of road text convolutional neural networks model and pinyin sequence channel.Last model calculates softmax
Value, the classification of judgement can be obtained.
In the present embodiment, method is tested.Under original test sample collection, to passing through pretreated test
Sample (removal noise), existing text convolutional network model obtains 0.97 accuracy rate.It is rolled up using the binary channels of this method
The accuracy rate that product neural network model obtains is 0.98.Crucial promotion is that normal short text is identified as undesirable short text
False detection rate fall below 0.016 from 0.019.Due to being largely normal sample in true scene, illegal short text is still
It is a small number of.Therefore the reduction of false detection rate can reduce the quantity for making a false report illegal short text, enhance the practicability of method.
Claims (7)
- The illegal short text recognition methods 1. a kind of band based on binary channels text convolutional neural networks is made an uproar, it is characterised in that including such as Lower step:1) band is made an uproar the pretreatment of short text;The step 1) include numerical character standardization, English character standardization, traditional Chinese character turn simplified Chinese character, Special Significance Symbol processing, removal are mingled with noise symbol, continuous number character unified representation, character string cutting and Chinese character and turn Pinyin representation;2) building of binary channels text convolutional neural networks model;The step 2) is specially to create one to input the text of character string and pinyin sequence volume after pretreatment simultaneously Product neural network model influences classification performance for eliminating the replacement of unisonance character;3) training of binary channels text convolutional neural networks model and in real time identification;Wherein training process realizes parameter by sample Optimization;Real-time recognition process is that short text is input to model and is classified.
- The illegal short text recognition methods 2. the band according to claim 1 based on binary channels text convolutional neural networks is made an uproar, It is characterized in that the described numerical character standardization is by shape is similar or semantic congruence but the different numeric word of character code Symbol is converted to half-angle number;English character standardization is by shape is similar or semantic congruence but the different English words of character code Symbol is converted to small English character.
- The illegal short text recognition methods 3. the band according to claim 1 based on binary channels text convolutional neural networks is made an uproar, It is characterized in that it is to filter out non-Chinese to the short text after previous step is processed that the removal, which is mingled with noise symbol, All symbols of character, non-English character and nonnumeric character.
- The illegal short text recognition methods 4. the band according to claim 1 based on binary channels text convolutional neural networks is made an uproar, It is characterized in that the continuous number character unified representation is by numerical character continuous in short text according to numerical character Number is expressed as the form of "<num_n>", and wherein n indicates the number of continuous numerical character, for eliminating digit strings Sparse Problem.
- The illegal short text recognition methods 5. the band according to claim 1 based on binary channels text convolutional neural networks is made an uproar, It is characterized in that the binary channels text convolutional neural networks model in the step 2) include character text convolutional neural networks and Phonetic text convolutional neural networks;The input of one of them is character string, another input is pinyin sequence;Step 2) specifically: the embeding layer of one term vector of building is indicated for character or phonetic to be converted to term vector;Then The term vector of sentence is indicated to carry out convolution according to the scale of convolution, a convolution kernel obtains several convolution results;Then right All convolution results carry out nonlinear activation function and carry out Nonlinear Processing;Most latter two text convolutional neural networks are by most Big value Chi Huahou, obtained characteristic value is stitched together, and be input in softmax and classified by full articulamentum;Two of them text convolutional neural networks can set different term vector length, vocabulary, different convolution scales.
- The illegal short text recognition methods 6. the band according to claim 1 based on binary channels text convolutional neural networks is made an uproar, It is characterized in that the training process in the step 3) specifically:A sample database is constructed, sample is divided into positive and negative two classes sample, respectively indicates illegal short text and normal short text;Often A sample is character string by preprocessing transformation, while character string is converted into pinyin sequence;By character string when training The corresponding input item of binary channels text convolutional neural networks model is separately input to corresponding pinyin sequence;And it gives corresponding Sample label, 0 indicates normal, and 1 indicates illegal;The loss function set when binary channels text convolutional neural networks model training are as follows:Loss=tf.reduce_mean (loss1)+lambda*l2_loss.WhereinL2_loss is to prevent the increased parameter regular terms of parameter over-fitting, and loss1 is entropy loss function of reporting to the leadship after accomplishing a task;It is first to complete The output of articulamentum carries out softmax functional operation,To convert the output into the probability value for belonging to each class;Then the mark of the output to softmax function and authentic specimen Sign entropy of reporting to the leadship after accomplishing a task;Tf.reduce_mean function is used to calculate the entropy of reporting to the leadship after accomplishing a task that is averaged of a batch in loss1;Lambda is power Weight;In training, all samples are all passed through into pretreatment and are converted to character string and pinyin sequence;Then according to training sample This constructs the vocabulary of character and phonetic respectively;The character in training sample is turned further according to character and phonetic respective vocabulary It is changed to digital id;Then the logical of the character string of binary channels text convolutional neural networks model is separately input to together with sample label The channel in road and pinyin sequence;Get out the constantly data of one batch of iteration, and the ladder for passing through loss function after training data Degree carrys out undated parameter, until reaching stopping criterion for iteration.
- The illegal short text recognition methods 7. the band according to claim 1 based on binary channels text convolutional neural networks is made an uproar, It is characterized in that the identification process in the step 3) specifically:Firstly, import it is trained after model and model parameter, and the vocabulary that while importing trained constructs;Import model and model Parameter uses the restore method of the tf.train.import_meta_graph class of TensorFlow;Vocabulary is imported to use VocabularyProcessor.restore function in TensorFlow;Then, need to band to be tested short text of making an uproar pre-processed to obtain sequence of words and pinyin sequence, then by two sequences Column are separately input in the corresponding character string channel of binary channels text convolutional neural networks model and pinyin sequence channel, finally Model calculates the value of softmax, and the classification of judgement can be obtained.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811446969.6A CN109670041A (en) | 2018-11-29 | 2018-11-29 | A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811446969.6A CN109670041A (en) | 2018-11-29 | 2018-11-29 | A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109670041A true CN109670041A (en) | 2019-04-23 |
Family
ID=66144697
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811446969.6A Pending CN109670041A (en) | 2018-11-29 | 2018-11-29 | A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109670041A (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175645A (en) * | 2019-05-27 | 2019-08-27 | 广西电网有限责任公司 | A kind of method and computing device of determining protective device model |
CN110188743A (en) * | 2019-05-13 | 2019-08-30 | 武汉大学 | A kind of taxi invoice identifying system and method |
CN110222173A (en) * | 2019-05-16 | 2019-09-10 | 吉林大学 | Short text sensibility classification method and device neural network based |
CN110705218A (en) * | 2019-10-11 | 2020-01-17 | 浙江百应科技有限公司 | Outbound state identification mode based on deep learning |
CN110751216A (en) * | 2019-10-21 | 2020-02-04 | 南京大学 | Judgment document industry classification method based on improved convolutional neural network |
CN110795536A (en) * | 2019-10-29 | 2020-02-14 | 秒针信息技术有限公司 | Short text matching method and device, electronic equipment and storage medium |
CN111651974A (en) * | 2020-06-23 | 2020-09-11 | 北京理工大学 | Implicit discourse relation analysis method and system |
CN111783998A (en) * | 2020-06-30 | 2020-10-16 | 百度在线网络技术(北京)有限公司 | Illegal account recognition model training method and device and electronic equipment |
JP2021015549A (en) * | 2019-07-16 | 2021-02-12 | 株式会社マクロミル | Information processing method and information processing device |
CN112989789A (en) * | 2021-03-15 | 2021-06-18 | 京东数科海益信息科技有限公司 | Test method and device of text audit model, computer equipment and storage medium |
CN113449510A (en) * | 2021-06-28 | 2021-09-28 | 平安科技(深圳)有限公司 | Text recognition method, device, equipment and storage medium |
CN113591440A (en) * | 2021-07-29 | 2021-11-02 | 百度在线网络技术(北京)有限公司 | Text processing method and device and electronic equipment |
WO2021232725A1 (en) * | 2020-05-22 | 2021-11-25 | 百度在线网络技术(北京)有限公司 | Voice interaction-based information verification method and apparatus, and device and computer storage medium |
CN114529259A (en) * | 2022-02-18 | 2022-05-24 | 苏州浪潮智能科技有限公司 | Conference summary auditing method, device, equipment and storage medium |
JP2022537000A (en) * | 2020-05-22 | 2022-08-23 | バイドゥ オンライン ネットワーク テクノロジー(ペキン) カンパニー リミテッド | Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction |
CN116226357A (en) * | 2023-05-09 | 2023-06-06 | 武汉纺织大学 | Document retrieval method under input containing error information |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101876968A (en) * | 2010-05-06 | 2010-11-03 | 复旦大学 | Method for carrying out harmful content recognition on network text and short message service |
US9053431B1 (en) * | 2010-10-26 | 2015-06-09 | Michael Lamport Commons | Intelligent control with hierarchical stacked neural networks |
CN106844349A (en) * | 2017-02-14 | 2017-06-13 | 广西师范大学 | Comment spam recognition methods based on coorinated training |
CN108062303A (en) * | 2017-12-06 | 2018-05-22 | 北京奇虎科技有限公司 | The recognition methods of refuse messages and device |
-
2018
- 2018-11-29 CN CN201811446969.6A patent/CN109670041A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101876968A (en) * | 2010-05-06 | 2010-11-03 | 复旦大学 | Method for carrying out harmful content recognition on network text and short message service |
US9053431B1 (en) * | 2010-10-26 | 2015-06-09 | Michael Lamport Commons | Intelligent control with hierarchical stacked neural networks |
CN106844349A (en) * | 2017-02-14 | 2017-06-13 | 广西师范大学 | Comment spam recognition methods based on coorinated training |
CN108062303A (en) * | 2017-12-06 | 2018-05-22 | 北京奇虎科技有限公司 | The recognition methods of refuse messages and device |
Non-Patent Citations (4)
Title |
---|
余本功等: "基于CP-CNN的中文短文本分类研究", 《计算机应用研究》 * |
李少卿等: "不良文本变体关键词识别的词汇串相似度计算", 《计算机应用与软件》 * |
毕银龙: "中文垃圾短文本的自动识别研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
陈鹏展等: "《个体行为的机器识别与决策协同》", 31 July 2018 * |
Cited By (21)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110188743A (en) * | 2019-05-13 | 2019-08-30 | 武汉大学 | A kind of taxi invoice identifying system and method |
CN110222173A (en) * | 2019-05-16 | 2019-09-10 | 吉林大学 | Short text sensibility classification method and device neural network based |
CN110222173B (en) * | 2019-05-16 | 2022-11-04 | 吉林大学 | Short text emotion classification method and device based on neural network |
CN110175645A (en) * | 2019-05-27 | 2019-08-27 | 广西电网有限责任公司 | A kind of method and computing device of determining protective device model |
JP2021015549A (en) * | 2019-07-16 | 2021-02-12 | 株式会社マクロミル | Information processing method and information processing device |
CN110705218A (en) * | 2019-10-11 | 2020-01-17 | 浙江百应科技有限公司 | Outbound state identification mode based on deep learning |
CN110751216A (en) * | 2019-10-21 | 2020-02-04 | 南京大学 | Judgment document industry classification method based on improved convolutional neural network |
CN110795536A (en) * | 2019-10-29 | 2020-02-14 | 秒针信息技术有限公司 | Short text matching method and device, electronic equipment and storage medium |
WO2021232725A1 (en) * | 2020-05-22 | 2021-11-25 | 百度在线网络技术(北京)有限公司 | Voice interaction-based information verification method and apparatus, and device and computer storage medium |
JP7266683B2 (en) | 2020-05-22 | 2023-04-28 | バイドゥ オンライン ネットワーク テクノロジー(ペキン) カンパニー リミテッド | Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction |
JP2022537000A (en) * | 2020-05-22 | 2022-08-23 | バイドゥ オンライン ネットワーク テクノロジー(ペキン) カンパニー リミテッド | Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction |
CN111651974B (en) * | 2020-06-23 | 2022-11-01 | 北京理工大学 | Implicit discourse relation analysis method and system |
CN111651974A (en) * | 2020-06-23 | 2020-09-11 | 北京理工大学 | Implicit discourse relation analysis method and system |
CN111783998A (en) * | 2020-06-30 | 2020-10-16 | 百度在线网络技术(北京)有限公司 | Illegal account recognition model training method and device and electronic equipment |
CN111783998B (en) * | 2020-06-30 | 2023-08-11 | 百度在线网络技术(北京)有限公司 | Training method and device for illegal account identification model and electronic equipment |
CN112989789A (en) * | 2021-03-15 | 2021-06-18 | 京东数科海益信息科技有限公司 | Test method and device of text audit model, computer equipment and storage medium |
CN113449510A (en) * | 2021-06-28 | 2021-09-28 | 平安科技(深圳)有限公司 | Text recognition method, device, equipment and storage medium |
CN113449510B (en) * | 2021-06-28 | 2022-12-27 | 平安科技(深圳)有限公司 | Text recognition method, device, equipment and storage medium |
CN113591440A (en) * | 2021-07-29 | 2021-11-02 | 百度在线网络技术(北京)有限公司 | Text processing method and device and electronic equipment |
CN114529259A (en) * | 2022-02-18 | 2022-05-24 | 苏州浪潮智能科技有限公司 | Conference summary auditing method, device, equipment and storage medium |
CN116226357A (en) * | 2023-05-09 | 2023-06-06 | 武汉纺织大学 | Document retrieval method under input containing error information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109670041A (en) | A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods | |
CN108984530B (en) | Detection method and detection system for network sensitive content | |
CN110807328B (en) | Named entity identification method and system for legal document multi-strategy fusion | |
CN113254599B (en) | Multi-label microblog text classification method based on semi-supervised learning | |
CN109446404B (en) | Method and device for analyzing emotion polarity of network public sentiment | |
CN110134946B (en) | Machine reading understanding method for complex data | |
CN109977416A (en) | A kind of multi-level natural language anti-spam text method and system | |
CN109684642B (en) | Abstract extraction method combining page parsing rule and NLP text vectorization | |
CN110287323B (en) | Target-oriented emotion classification method | |
CN110532554A (en) | A kind of Chinese abstraction generating method, system and storage medium | |
CN109635288A (en) | A kind of resume abstracting method based on deep neural network | |
CN110415071B (en) | Automobile competitive product comparison method based on viewpoint mining analysis | |
CN109002473A (en) | A kind of sentiment analysis method based on term vector and part of speech | |
CN107688630B (en) | Semantic-based weakly supervised microbo multi-emotion dictionary expansion method | |
CN108563638A (en) | A kind of microblog emotional analysis method based on topic identification and integrated study | |
CN112434164B (en) | Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration | |
CN105975497A (en) | Automatic microblog topic recommendation method and device | |
CN111444704B (en) | Network safety keyword extraction method based on deep neural network | |
CN110826298A (en) | Statement coding method used in intelligent auxiliary password-fixing system | |
CN107992468A (en) | A kind of mixing language material name entity recognition method based on LSTM | |
CN107894976A (en) | A kind of mixing language material segmenting method based on Bi LSTM | |
CN113220964B (en) | Viewpoint mining method based on short text in network message field | |
CN116932736A (en) | Patent recommendation method based on combination of user requirements and inverted list | |
CN111178009B (en) | Text multilingual recognition method based on feature word weighting | |
CN111460147A (en) | Title short text classification method based on semantic enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB03 | Change of inventor or designer information |
Inventor after: Zhou Jianzheng Inventor after: Yao Jinliang Inventor after: Ming Jianhua Inventor after: Huang Jinhai Inventor before: Zhou Jianzheng Inventor before: Yao Jinliang Inventor before: Huang Jinhai Inventor before: Ming Jianhua Inventor before: Yu Yuelun |
|
CB03 | Change of inventor or designer information | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190423 |
|
RJ01 | Rejection of invention patent application after publication |