CN109670041A - A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods - Google Patents

A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods Download PDF

Info

Publication number
CN109670041A
CN109670041A CN201811446969.6A CN201811446969A CN109670041A CN 109670041 A CN109670041 A CN 109670041A CN 201811446969 A CN201811446969 A CN 201811446969A CN 109670041 A CN109670041 A CN 109670041A
Authority
CN
China
Prior art keywords
character
text
convolutional neural
neural networks
binary channels
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811446969.6A
Other languages
Chinese (zh)
Inventor
周建政
姚金良
黄金海
明建华
俞月伦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tiange Technology (hangzhou) Co Ltd
Original Assignee
Tiange Technology (hangzhou) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tiange Technology (hangzhou) Co Ltd filed Critical Tiange Technology (hangzhou) Co Ltd
Priority to CN201811446969.6A priority Critical patent/CN109670041A/en
Publication of CN109670041A publication Critical patent/CN109670041A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/191Automatic line break hyphenation

Abstract

It makes an uproar illegal short text recognition methods the present invention relates to a kind of band based on binary channels text convolutional neural networks.It makes an uproar the pretreatment of short text, the building of binary channels text convolutional neural networks model and the training of model and identification in real time including band.The make an uproar pretreatment of short text of band is used for the standardization of noise character, eliminates the influence of noise, improves the learning ability of convolutional neural networks model.Binary channels text convolutional neural networks model is the text convolutional neural networks model that can input character string and pinyin sequence after pretreatment simultaneously.Due to increasing input and the modeling ability of pinyin sequence, which, which can eliminate the replacement of unisonance character, influences classification performance.The present invention, which is capable of handling the brings such as the replacement of unisonance character, the similar English character replacement of shape, the identical numerical chracter replacement of various semantemes, to be influenced, and experimental result shows the method for the present invention to the identification recognition accuracy with higher of the illegal short text with noise and lower false detection rate.

Description

A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text identification Method
Technical field
The invention belongs to Computer Natural Language Processing field, it is related to a kind of based on binary channels text convolutional neural networks Band is made an uproar illegal short text recognition methods.
Background technique
With the fast development of network, carrying out information, the sharing of viewpoint and communication by network becomes current network application Important way.For example, being discussed by BBS to certain problems;Viewpoint, news and comment are delivered by microblogging;By i.e. When means of communication exchanged;It is commented on by the review pages in news website;It is exchanged by net cast service; And currently popular video content comment on by barrage in video playing etc..This user generates the mould of content The information that formula facilitates user is shared and is exchanged.But this internet content published method is also easy to be utilized by criminal, Some illegal advertising informations are issued, wherein porno advertisement is presently most main illegal flame, these information are general It is all issued by way of short text, and directs the user to porn site, the QQ and wechat account of sex service are provided On.In order to prevent the propagation of these invalid informations, website and various apply need to construct special processing routine in server end Automatic identification is carried out to the content that user submits, is confirmed whether to be invalid information.If it is invalid information, then publication is prevented to believe Breath, and correlation function is closed to illegal account.
Presently most common flame identification and filter method are the methods based on keyword filtering.This method needs One illegal lists of keywords of building in advance.There are the words in lists of keywords to be searched whether to the content of text of submission, if In the presence of being then considered illegal contents.This method treatment effeciency is high, but misclassification rate is also high.It can identification include mistakenly key The normal text content of word.In order to cope with this problem, it is thus proposed that the method based on text classification.Such methods generally will The text representation of input is vector space model, according to the character of appearance or word construction feature vector, and combines TF*IDF To express the importance of word or character feature.Then classified using the method for statistical machine learning to feature vector.Often Classifier has: support vector machines, adaboost, neural network, decision tree etc..Such methods can reduce text to a certain extent The false detection rate of this identification.But since the character that short text includes is limited, and the context relation between character is not accounted for, The recognition accuracy of such method is still difficult to meet the requirement of practical application.
In addition, since illegal user can also improve the form of publication content according to identifying system, thus the knowledge of avoidance system Not.The current major way escaping identification and filtering is to carry out variant to the keyword in short text content.For example, by ' invoice ' It is write as " hair drift ", ' naked to chat ' is write into as ' falling merely ' etc..Count existing some illegal pornographic advertising informations, discovery there is currently Chief word variant form are as follows: (1) be mingled with additional character (the usually non-legible character of punctuate class), such as: " QQ296 " 161『7102";(2) character similar in shape substitutes, for example, " day " replaces with " saying ";(3) unisonance or the replacement of nearly sound, such as: Micro-replaces with " for ";(4) phonetic is replaced, and " wechat " replaces with " wei xin ";(5) keyword inverted order or whole sentence inverted order; (6) Chinese character splits into radical and other characters, such as: " naked " is split as " Yi fruit ".(7) keyword traditional font;(8) to English Text and number etc. are interted using similar shape character, such as: " a5m2coM ";(9) to numerical character be converted to number form or Person's Chinese figure etc., for example: " Wei 765510103 is (ii) ".
In order to cope with the variant form of keyword, a kind of method for generalling use keyword expansion.This method is by keyword Table constructs the variant form of keyword, and a part as antistop list according to possible variant form.In addition, in order to cope with Keyword variant, Wen Yuanxu propose extract variant feature method [variant short text filter algorithm research, 2012.12, Beijing University of Post and Telecommunication, master thesis].This method proposes to become using some regular construction features to express the keyword being likely to occur Then body form is identified using Bayes classifier.But variant feature is extracted by the way of building rule and is easy quilt Illegal user's identification, to further escape the identification of system by more neomorphic producing method.In addition, artificial building becomes Body characteristics are a relatively difficult job.
It is difficult to be effectively treated keyword variant for current method and traditional short text classification method accuracy rate is not high asks Topic, the method for the present invention proposes the variant feature having found that it is likely that in the sample using depth learning technology, and excavates between character Correlative character, improve band make an uproar short text classification accuracy rate.This method utilizes the powerful learning ability of deep learning, and It is easy to be updated according to sample, so as to quickly cope with emerging variant form.The inventive method can be applied to various nets It stands and the server end of Internet application, the automatic identification with noise short text content that user is submitted in realization prevents nocuousness The propagation of information.
Summary of the invention
In view of the deficiencies of the prior art, the present invention provides a kind of band based on binary channels text convolutional neural networks make an uproar it is non- Method short text recognition methods.The method of the present invention mainly includes three parts: band is made an uproar the pretreatment of short text, binary channels text convolution The building of neural network model and the training of binary channels text convolutional neural networks model and in real time identification.Band is made an uproar short text Pretreatment be influence for reducing noise character to subsequent convolutional neural networks model training and identification.Binary channels text volume The building of product neural network model is the nerve net replaced this noise like addition mode for unisonance character and carry out targeted design Network model.The training of binary channels text convolutional neural networks model and in real time identification then use the relevant technologies of deep learning.
(1) band make an uproar short text pretreatment specifically include that numerical character standardization, English character standardization, Chinese-traditional Character turn simplified Chinese character, Special Significance Symbol processing, removal be mingled with noise symbol, continuous number character unified representation, word Symbol string cutting and Chinese character turn 7 treatment processes of pinyin representation.Numerical character standardization and English character standardization are by shape Shape is similar or semantic congruence but the different numerical character and English character of character code are converted to half-angle number and small English Character.Numerical character standardization is that one numerical character of building corresponds to table, having under being encoded all unicode by the table The symbol of the various different codings of numerical significance is converted to the numerical character coding of standard.Such as: will " 7. ", " zero ", " (ii) " point Be not converted to " 7 ", " 0 ", " 2 " pattern standard digital character.English character standardization is by all with similar shape The letter character of different coding is converted to the small English character of standard.Such as: it willBe respectively converted into " a ", The standard small English character of " f ", " k " pattern.This method is realized by constructing the table of comparisons of a dictionary structure.Traditional font It is that traditional Chinese character that may be present in character string is converted to familiar form that Chinese character, which turns simplified Chinese character,.This method Table by constructing a corresponding relationship is realized.Special Significance symbol conversion process is by some symbols to acquire a special sense The symbol of corresponding feature meaning number is converted to, such as: the symbol by shapes such as " ╁ " similar to "+" is converted to Chinese character " adding ".This method searches the symbol of Special Significance that may be present by the analysis to corpus, and it is special to construct table of comparisons realization The conversion of meaning character.Removal be mingled with noise symbol be the short text obtained after above-mentioned character conversion operation is filtered out it is non- All symbols of Chinese character, non-English character and nonnumeric character, such as: " ˇ " in removal " scratching stingy 251 ˇ, 764 ˇ 5947 ". Continuous number character representation is that numerical character continuous in short text is expressed as "<num_n>" according to the number of numerical character Form, wherein n indicates the number of continuous numerical character, such as: " 2517645947 will be scratched " and be expressed as " scratching < num_10 >".Expression in this way can reduce the sparse influence of numerical character sample.Character string cutting is converted by above-mentioned character Cutting is carried out to short text afterwards.In the method, Chinese character is a character by independent cutting, and continuous English character is cut It is divided into a unit, continuous numerical character uses "<num_n>" as a unit.Such as: " jia scratches stingy<num_10>" is cut It is divided into " jia scratches stingy<num_10>".It is to be expressed as Chinese character on the basis of character string cutting that Chinese character, which turns pinyin representation, Its corresponding not toned phonetic.Such as: " jia scratches stingy<num_10>" is converted into " jia kou kou<num_10>". This method equally realizes the conversion of Chinese character to phonetic by dictionary format.
The influence of most of noise character can be eliminated by the above pre-treatment step, when reducing deep learning model training The sample size needed, and method of assuring has certain adaptability to the noise character not occurred in identification.
(2) construction of binary channels text convolutional neural networks model;Pre-treatment step can eliminate most of noise character Influence, but the unisonance character for being difficult to eliminate Chinese character replaces the influence of this difficulties.Based on this, this method is created One can input the text convolutional neural networks model of character string and pinyin sequence after pretreatment simultaneously, to guarantee model The replacement of unisonance character, which can be eliminated, influences classification performance.Twin-channel text convolutional neural networks model structure is as shown in Figure 3. There are two text convolutional neural networks models to constitute for the convolutional neural networks.The input of one of them is character string, another Input be pinyin sequence.The network structure can capture the information of character and phonetic simultaneously, and what the replacement of unisonance character generated makes an uproar Sound text can be solved by pinyin sequence.Its structure is the embeding layer of a term vector first, for by character (word, Phonetic) be converted to term vector expression;Then the term vector of sentence is indicated to carry out convolution, a convolution kernel according to the scale of convolution Obtain several convolution results;Then nonlinear activation function is carried out to all convolution results and carries out Nonlinear Processing;It is right again Processing result maximum value pond, each filter (single convolution kernel) obtain a value.Finally by all filters Output is input to softmax by full articulamentum and classifies.Twin-channel text convolutional neural networks are divided into the method Character text convolutional neural networks and phonetic text convolutional neural networks.Two text convolutional neural networks can set different Term vector length, vocabulary, different convolution scales.Most latter two text convolutional neural networks, will using maximum value Chi Huahou Obtained characteristic value is stitched together, and is input in softmax by full articulamentum.
(3) training of binary channels text convolutional neural networks model and in real time identification.In order to train binary channels text convolution Neural network needs to construct a sample database.Sample is divided into positive and negative two classes sample, respectively indicates: illegal short text and normal Short text.Each sample is character string by preprocessing transformation, while character string is converted into pinyin sequence (encumbrance Word character and English character are not converted).Character string and corresponding pinyin sequence are separately input to binary channels text when training The corresponding input item of convolutional neural networks model.And corresponding sample label is given, 0 indicates normal, and 1 indicates illegal.This text The loss function set when convolutional neural networks model training are as follows:
Loss=tf.reduce_mean (loss1)+lambda*l2_loss.
Wherein
Parameter over-fitting increases parameter regular terms to l2_loss in order to prevent.The regular terms acts on complete before softmax The weight of articulamentum.Loss1 is entropy loss function of reporting to the leadship after accomplishing a task.It is first to the output of full articulamentum (value that output is each classification) Softmax functional operation is carried out, to convert the output into the probability value for belonging to each class;Then to the defeated of softmax function It reports to the leadship after accomplishing a task out with the label of authentic specimen (classification) entropy.Tf.reduce_mean function is used to calculate in loss1 a batch's It averagely reports to the leadship after accomplishing a task entropy.Therefore, the loss function of this method includes intersecting entropy function and the loss of weight regular terms, and wherein Lambda is two Weight between person.After setting last function, method set optimal method, by gradient descent algorithm come calculating parameter most Excellent solution.This method is using Adam optimal method.In training, all samples are all passed through pretreatment and turned by this method It is changed to character string and pinyin sequence;Then the vocabulary of character and phonetic is constructed respectively according to training sample;Further according to character The character in training sample is converted into digital id (subscript that essence is vocabulary) with phonetic respective vocabulary;Then together with Sample label is separately input to the channel of the character string of binary channels text convolutional neural networks model and the channel of pinyin sequence. Be ready to after training data can the continuous data of one batch of iteration, it is real and by the gradient of loss function come undated parameter The training of existing model.
Binary channels text convolutional neural networks model parameter is trained, method can carry out real-time band by the model It makes an uproar the identification of short text.Firstly, it is necessary to import model and model parameter, and the vocabulary that while importing trained constructs;Then, need to Band to be tested short text of making an uproar is pre-processed to obtain sequence of words and pinyin sequence;Then two sequences are separately input to double In the corresponding character string channel of channel text convolutional neural networks model and pinyin sequence channel.Last model calculates The classification of judgement can be obtained in the value of softmax.
The present invention has the advantages that compared with the existing technology
The present invention can be used for band and make an uproar the identification of illegal short text, improve file classification method to adding various man made noises' The accuracy rate and robustness of illegal short text identification;Especially the method for the present invention can effectively be identified and can not be currently effectively treated The illegal short text of various phonetically similar word replacements.The experiment proves that the validity based on binary channels text convolutional neural networks.
Detailed description of the invention
Fig. 1 shows flow charts of the invention;
Fig. 2 indicates pretreatment process figure in the method for the present invention;
Fig. 3 indicates the binary channels text convolutional neural networks model in the method for the present invention;
The situation of change of the training process accuracy rate of Fig. 4 binary channels text convolutional neural networks;Dark strokes is shallow in figure Result after colo(u)r streak smoothing processing;
The situation of change of (loss), dark line in figure are lost in the training process of Fig. 5 binary channels text convolutional neural networks Item is the result after light lines smoothing processing.
Specific embodiment
The present invention is described in detail below in conjunction with attached drawing, it is noted that described embodiment is only convenient for pair Understanding of the invention, and do not play any restriction effect to it.
The method of the present invention is not limited to handle pornographic text information, and other similar illegal advertising informations can equally be had Effect processing, such as: the various advertisement texts drawn a bill, it is only necessary to which the sample information for collecting associated class can be obtained by study Corresponding identifier.In the present embodiment, main object to be processed is pornographic popularization text, i.e., various non-in the identification network platform The various porno advertisement text informations of method user publication, it is existing illegal for breaking through that these information are largely all added to noise The detection system of text.The present embodiment is realized using this deep learning frame of tensorflow, but the building of model and instruction White silk can equally use other deep learning frames.
It is further described the embodiment of the present invention below with reference to the accompanying drawings.
Fig. 1 is a flow diagram, illustrates various pieces relationship and its process of the present invention.One kind being based on binary channels text The band of convolutional neural networks is made an uproar illegal short text recognition methods, specifically includes following part: band is made an uproar the pretreatment of short text, bilateral Road text convolutional neural networks model construction, the training of binary channels text convolutional neural networks model and based on binary channels text roll up The identification of product neural network model.Preprocessing part is used to the various characters in short text that band is made an uproar being unified for standard character, Noise is removed, and carries out cutting.Binary channels text convolutional neural networks model construction mainly creates an adaptation noise text The text convolutional neural networks model of identification.The training of binary channels text convolutional neural networks model is the network mould based on building Type and the loss function of definition obtain the parameter of network model by sample data and training parameter setting.Based on binary channels text The identification of this convolutional neural networks model is that short text is made an uproar to the band of input according to trained obtained network model parameter progress reality When classification.The realization of entire method includes two basic processes: model training process and real-time recognition process.In model training When process, first choice needs to design binary channels text convolutional neural networks model;Then it goes forward side by side to trained sample by pretreatment The data preparation of row training sample;Then by the parameter of setting training parameter training convolutional neural networks model, training reaches Model is saved after training requirement.In real-time recognition process, by the band of input make an uproar short text pre-process then it is defeated Entering into binary channels text convolutional neural networks model to carry out identification in real time can be obtained recognition result.It is described in detail below each Partial realization.
(1) it pre-processes
Pre-treatment step is handled the band of input short text of making an uproar by 7 treatment processes, and the detailed process of processing is shown in Attached drawing 2.The result of processing is used directly for the training or identification in real time of binary channels text convolutional neural networks model.Pretreatment Target be reduce noise influence, although noise symbol can also be used as vocabulary carry out text convolutional neural networks training, But due to the sparsity of the diversity of noise symbol addition and training sample data, these noises are directly removed by pretreatment Symbol can preferably cope with the noise addition operation of various symbols transformation.The pre-treatment step of this method includes: numerical character Standardization, English character standardization, traditional Chinese character turn simplified Chinese character, Special Significance Symbol processing, remove to be mingled with and make an uproar Sound symbol, continuous number character unified representation, character string cutting and Chinese character turn 7 treatment processes of pinyin representation.
Numerical character standardization is that the lower coded identification with numerical significance of all unicode coding is converted to standard Numerical character coding.Such as: will " 7. ", " zero ", " (ii) " be respectively converted into " 7 ", " 0 ", " 2 " pattern standard digital character.For Realize this conversion process, this method is by checking all symbols with numerical significance under unicode is encoded, according to number The order of character number constructs the corresponding relationship of code conversion.Such as: " 1. " arrivesSymbol in unicode coding schedule It is ordered into arrangement, for " 1. " arrivingThis method is realized by following formula and is converted: ch_out=chr (ord (ch)- ord("①")+ord("1")).Wherein ord function is the unicode coding for obtaining character, and chr function is compiled by unicode Code obtains corresponding character.Similar, for ' 10. '' 1 '+chr of ch_out=(ord (ch)-ord (" 10. ") +ord("0"));ForThen specially treated, if ch isThen exporting indirect assignment is " 20 ".In the same way, The numerical character standardization of this method: ' 0'≤ch≤' 9', " (i) "≤ch < =" (x) ", ' I '≤ch≤' Ⅻ ',For ' 1. '≤ch≤' 20. ' and Then added behind the result of conversion ' point ' symbol.For' day ' symbol is then added behind the result of conversion.For " 012345 Character in 67890 " and " 12345 lands 789 " then obtains input character in the character by index function Then the integer value of position is converted to character by str function by the position in string.And for the symbol of nonnumeric meaning, number Word character standard directly returns to original character without processing.
English character standardization is that the letter character of all different codings with similar shape is converted to standard Small English character.Such as: it willIt is respectively converted into the standard small English character of " a ", " f ", " k " pattern. Since similar letter character does not have continuously being encoded for unification in unicode coding schedule, this method passes through structure The table of comparisons for building a dictionary structure is realized.For this purpose, this method similar letter character of one file storage shape;It will The similar letter character of shape saves as a line in file in unicode.Such as:Wherein first ' h ' is standard alphabet, and will with ": " The similar letter of shape separates.Subsequent similar letter passes through space-separated.This method constructs all 26 English alphabets Shape similarity sign corresponding relationship.By this document, method can create a dictionary structure, and key therein is shape similar character Symbol, is worth for standard English character.The conversion of character is realized by searching for the dictionary.Capitalization is then passed through and is commonly capitalized Letter turns the realization of lowercase function.English alphabet standardization then exports non-English letter as former state.
It is that traditional Chinese character that may be present in character string is converted to letter that traditional Chinese character, which turns simplified Chinese character, Body form.This method is realized by constructing the table of a corresponding relationship.The table is by searching for institute in unicode coding There are the different all Chinese characters of simplified and traditional font to realize.It equally uses dictionary structure and realizes quickly conversion.It realizes When used this python packet of zhtools.The packet constructs simplified and traditional font corresponding relationship.
Special Significance symbol conversion process is that some symbols to acquire a special sense are converted to corresponding certain sense Symbol.Such as: shapes such as " ╁ " are converted into Chinese character " adding " similar to the symbol of "+".Because of the character of these Special Significances Specific semantic information is represented, and important.Many noise addition modes be exactly by by it is significant it is specific in Chinese character is converted to this kind of character, and is replaced by shape similarity.For this purpose, this method is similar to similar shape English The conversion method of letter is realized, a table of comparisons is constructed, such as: add: ++ ╂ ╃ ╁ ╀ ┿ ┾ ┽ ┼ ╋.Finally by Dictionary structure realizes conversion.This method searches the symbol of Special Significance that may be present, and building pair by the analysis to corpus The conversion of Special Significance character is realized according to table.
It is to filter out non-middle text to the short text obtained after above-mentioned character conversion operation that removal, which is mingled with noise symbol, All symbols of symbol, non-English character and nonnumeric character, such as: " ˇ " in removal " scratching stingy 251 ˇ, 764 ˇ 5947 ".Standardization Character string this operation after character is easier to realize, directly removes above-mentioned non-Chinese character, non-English character and nonnumeric All symbols of character.
Continuous number character representation is will be by numerical character continuous in the short text that handles above according to numerical character Number be expressed as the form of "<num_n>", wherein n indicates the number of continuous numerical character.Such as: it " will scratch 2517645947 " are expressed as " scratching stingy<num_10>".The purpose indicated in this way is the diversity and instruction in order to eliminate numerical value Practice the sparsity of data.Such as: QQ number code will be generally above 7 bit digital characters, and number related with quantity is generally less than 7 Digit to facilitate the feature extraction and training of subsequent deep learning model, and can be coped with and not occur in training sample QQ number code.
Character string cutting is to carry out cutting to short text after the conversion of above-mentioned character.In the method, Chinese character quilt Independent cutting is a character, and continuous English character is split as a unit, continuous numerical character "<num_n>" work For a unit.Such as: " jia scratches stingy<num_10>" is split as " jia scratches stingy<num_10>".
It is that Chinese character is expressed as its on the basis of character string cutting is corresponding without sound that Chinese character, which turns pinyin representation, The phonetic of tune.Such as: " jia scratches stingy<num_10>" is converted into " jia kou kou<num_10>".This method equally passes through word Allusion quotation form realizes the conversion of Chinese character to phonetic.This python packet of pinyin is used when realization, it is direct to Chinese character Call get method therein that can obtain the phonetic of corresponding Chinese character.
(2) building of binary channels text convolutional neural networks model
Pre-treatment step can eliminate the influence of most of noise character, but the phonetically similar word for being difficult to eliminate Chinese character is replaced Change the influence of this difficulties.Since Chinese unisonance character is more, only by character string training text convolutional Neural net The training of network model is difficult to capture the information for not having the unisonance Chinese character occurred in sample, to make in paired samples without going out Existing unisonance character can not be identified effectively.Based on this, this method constructs a binary channels text convolutional neural networks Model, while character string and pinyin sequence after pretreatment are inputted, to guarantee that model copes with the shadow of unisonance character replacement It rings.Twin-channel text convolutional neural networks model structure is as shown in Figure 3.There are two text convolutional Neurals for the convolutional neural networks Network model is constituted.The input of one of them is character string, another input is pinyin sequence.Single text convolutional Neural Network architecture includes: the embeding layer of a term vector, is indicated for character to be converted to vector;Then according to the ruler of convolution Degree carries out convolution ,-filter_size+1 convolution results of a convolution kernel available len (sequence);Then to all Convolution results carry out ReLu activation primitive carry out Nonlinear Processing;Again to processing result maximum value pond, each Filter obtains a value.All filter values are finally input to softmax by full articulamentum to classify.In we Twin-channel text convolutional neural networks are divided into character text convolutional neural networks and phonetic text convolutional neural networks in method.Two A text convolutional neural networks can set different term vector length, vocabulary, different convolution scales.Most latter two text volume Obtained characteristic value is stitched together by product neural network using maximum value Chi Huahou, and is input to by full articulamentum In softmax.In the present embodiment, character text convolutional neural networks model and phonetic text convolutional neural networks model are set In word embeding layer term vector length be 128.The scale of the convolutional layer in two channels is set as (3,4,5), that is, convolution The size of core is respectively 3,4,5.It is respectively the relationship between 3,4,5 characters so as to capture span.In addition, to each volume 128 filters are arranged in product scale again.There is 3*128 filter in such a channel, and there is 2*3*128 filtering in two channels Device.Nonlinear function is set as Relu function.In pond, layer uses maximum value pond, that is, the result that a filter returns Output result of the maximum value as pond layer is taken in vector.For the present embodiment, available 2*3*128 characteristic value.
(3) training of binary channels text convolutional neural networks model
After building the network model of deep learning, method needs to set loss function, to be obtained by optimizing Model parameter optimal on training sample set.The loss function set when this text convolutional neural networks model training are as follows:
Loss=tf.reduce_mean (loss1)+lambda*l2_loss
Wherein parameter over-fitting increases parameter regular terms to l2_loss in order to prevent.Before the regular terms acts on softmax Full articulamentum weight.Loss1 is entropy loss function of reporting to the leadship after accomplishing a task.The present embodiment uses tensorflow function softmax_ Cross_entropy_with_logits is realized.It is first to the output of full articulamentum (value that output is each classification) Softmax functional operation is carried out,
To convert the output into the probability value for belonging to each class;Then the output to softmax function and authentic specimen Label (classification) report to the leadship after accomplishing a task entropy.Tf.reduce_mean function is used to calculate the entropy of reporting to the leadship after accomplishing a task that is averaged of a batch in loss1. Therefore, the loss function of this method includes intersecting entropy function and the loss of weight regular terms, and wherein Lambda is power between the two Weight.
After setting last function, method sets optimal method, by gradient descent algorithm come the optimal solution of calculating parameter. This method is using Adam optimal method.Adam algorithm is according to loss function to the single order moments estimation of the gradient of each parameter The learning rate of each parameter is directed to second order moments estimation dynamic adjustment.The Learning Step of Adam method has a range, no Can be because generating very big gradient on some sample lead to very big Learning Step, Parameters variation is more stable.
After loss function and optimal method has been determined, need to prepare training of the training sample for parameter.The present embodiment Training sample derive from net cast platform, most of porno advertisement information is sent out by advertisement robot on platform.Platform is logical The behavior pattern of posting for crossing user obtains porno advertisement short text 130,000, normal short text 80,000, total number of samples 210,000 Item wherein 90% will be used as training sample, and remaining 10% is used as test sample.Training when, this method by all samples all Character string and pinyin sequence are converted to by pretreatment.Then the vocabulary of character and phonetic is constructed respectively according to training sample Table.The character in training sample is converted to digital id further according to character and phonetic respective vocabulary, and (essence is vocabulary Subscript).In the present embodiment, pass through tensorflow.contrib.learn.preprocessing.VocabularyPro Cessor is realized.It is saved after the vocabulary of creation.
It needs to set training parameter before starting deep learning model training, in the present embodiment, sets batch_size It is 32, that is, 32 samples of input update model parameter for calculating average gradient simultaneously.Num_epoches is set as 3, that is, three times to the training of all training samples, operation (shuffle) is upset at random to training sample progress before starting every time, It can have been restrained completely according to laboratory num_epoches for 3.Checkpoint_every is set as 1000, that is, instructs After having practiced 1000 batches, a drag is saved.Num_checkpoints is set as 3, and three moulds are at most saved when training Type.
After the setting of above-mentioned training parameter and the preparation of training sample, so that it may start to train.In the present embodiment, training The accuracy rate situation of change of model in the process is as shown in Figure 4.The abscissa of Fig. 4 is the batch number of training.It can be sent out from figure Existing, by the training of 10000 batch, the accuracy rate of model has averagely reached the loss variation of the model of 0.94. training process Situation is as shown in Figure 5.
(4) identification based on binary channels text convolutional neural networks model
Binary channels text convolutional neural networks model parameter is trained, method can carry out real-time band by the model It makes an uproar the identification of short text.Firstly, it is necessary to import model and model parameter, and the vocabulary that while importing trained constructs.In this implementation In example, model and model parameter are imported using the restore of the tf.train.import_meta_graph class of TensorFlow Method;Vocabulary is imported using the VocabularyProcessor.restore function in TensorFlow.Then, it will need The band of test short text of making an uproar is pre-processed to obtain sequence of words and pinyin sequence, and two sequences are then separately input to bilateral In the corresponding character string channel of road text convolutional neural networks model and pinyin sequence channel.Last model calculates softmax Value, the classification of judgement can be obtained.
In the present embodiment, method is tested.Under original test sample collection, to passing through pretreated test Sample (removal noise), existing text convolutional network model obtains 0.97 accuracy rate.It is rolled up using the binary channels of this method The accuracy rate that product neural network model obtains is 0.98.Crucial promotion is that normal short text is identified as undesirable short text False detection rate fall below 0.016 from 0.019.Due to being largely normal sample in true scene, illegal short text is still It is a small number of.Therefore the reduction of false detection rate can reduce the quantity for making a false report illegal short text, enhance the practicability of method.

Claims (7)

  1. The illegal short text recognition methods 1. a kind of band based on binary channels text convolutional neural networks is made an uproar, it is characterised in that including such as Lower step:
    1) band is made an uproar the pretreatment of short text;
    The step 1) include numerical character standardization, English character standardization, traditional Chinese character turn simplified Chinese character, Special Significance Symbol processing, removal are mingled with noise symbol, continuous number character unified representation, character string cutting and Chinese character and turn Pinyin representation;
    2) building of binary channels text convolutional neural networks model;
    The step 2) is specially to create one to input the text of character string and pinyin sequence volume after pretreatment simultaneously Product neural network model influences classification performance for eliminating the replacement of unisonance character;
    3) training of binary channels text convolutional neural networks model and in real time identification;Wherein training process realizes parameter by sample Optimization;Real-time recognition process is that short text is input to model and is classified.
  2. The illegal short text recognition methods 2. the band according to claim 1 based on binary channels text convolutional neural networks is made an uproar, It is characterized in that the described numerical character standardization is by shape is similar or semantic congruence but the different numeric word of character code Symbol is converted to half-angle number;English character standardization is by shape is similar or semantic congruence but the different English words of character code Symbol is converted to small English character.
  3. The illegal short text recognition methods 3. the band according to claim 1 based on binary channels text convolutional neural networks is made an uproar, It is characterized in that it is to filter out non-Chinese to the short text after previous step is processed that the removal, which is mingled with noise symbol, All symbols of character, non-English character and nonnumeric character.
  4. The illegal short text recognition methods 4. the band according to claim 1 based on binary channels text convolutional neural networks is made an uproar, It is characterized in that the continuous number character unified representation is by numerical character continuous in short text according to numerical character Number is expressed as the form of "<num_n>", and wherein n indicates the number of continuous numerical character, for eliminating digit strings Sparse Problem.
  5. The illegal short text recognition methods 5. the band according to claim 1 based on binary channels text convolutional neural networks is made an uproar, It is characterized in that the binary channels text convolutional neural networks model in the step 2) include character text convolutional neural networks and Phonetic text convolutional neural networks;The input of one of them is character string, another input is pinyin sequence;
    Step 2) specifically: the embeding layer of one term vector of building is indicated for character or phonetic to be converted to term vector;Then The term vector of sentence is indicated to carry out convolution according to the scale of convolution, a convolution kernel obtains several convolution results;Then right All convolution results carry out nonlinear activation function and carry out Nonlinear Processing;Most latter two text convolutional neural networks are by most Big value Chi Huahou, obtained characteristic value is stitched together, and be input in softmax and classified by full articulamentum;
    Two of them text convolutional neural networks can set different term vector length, vocabulary, different convolution scales.
  6. The illegal short text recognition methods 6. the band according to claim 1 based on binary channels text convolutional neural networks is made an uproar, It is characterized in that the training process in the step 3) specifically:
    A sample database is constructed, sample is divided into positive and negative two classes sample, respectively indicates illegal short text and normal short text;Often A sample is character string by preprocessing transformation, while character string is converted into pinyin sequence;By character string when training The corresponding input item of binary channels text convolutional neural networks model is separately input to corresponding pinyin sequence;And it gives corresponding Sample label, 0 indicates normal, and 1 indicates illegal;
    The loss function set when binary channels text convolutional neural networks model training are as follows:
    Loss=tf.reduce_mean (loss1)+lambda*l2_loss.
    Wherein
    L2_loss is to prevent the increased parameter regular terms of parameter over-fitting, and loss1 is entropy loss function of reporting to the leadship after accomplishing a task;It is first to complete The output of articulamentum carries out softmax functional operation,
    To convert the output into the probability value for belonging to each class;Then the mark of the output to softmax function and authentic specimen Sign entropy of reporting to the leadship after accomplishing a task;Tf.reduce_mean function is used to calculate the entropy of reporting to the leadship after accomplishing a task that is averaged of a batch in loss1;Lambda is power Weight;
    In training, all samples are all passed through into pretreatment and are converted to character string and pinyin sequence;Then according to training sample This constructs the vocabulary of character and phonetic respectively;The character in training sample is turned further according to character and phonetic respective vocabulary It is changed to digital id;Then the logical of the character string of binary channels text convolutional neural networks model is separately input to together with sample label The channel in road and pinyin sequence;Get out the constantly data of one batch of iteration, and the ladder for passing through loss function after training data Degree carrys out undated parameter, until reaching stopping criterion for iteration.
  7. The illegal short text recognition methods 7. the band according to claim 1 based on binary channels text convolutional neural networks is made an uproar, It is characterized in that the identification process in the step 3) specifically:
    Firstly, import it is trained after model and model parameter, and the vocabulary that while importing trained constructs;Import model and model Parameter uses the restore method of the tf.train.import_meta_graph class of TensorFlow;Vocabulary is imported to use VocabularyProcessor.restore function in TensorFlow;
    Then, need to band to be tested short text of making an uproar pre-processed to obtain sequence of words and pinyin sequence, then by two sequences Column are separately input in the corresponding character string channel of binary channels text convolutional neural networks model and pinyin sequence channel, finally Model calculates the value of softmax, and the classification of judgement can be obtained.
CN201811446969.6A 2018-11-29 2018-11-29 A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods Pending CN109670041A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811446969.6A CN109670041A (en) 2018-11-29 2018-11-29 A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811446969.6A CN109670041A (en) 2018-11-29 2018-11-29 A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods

Publications (1)

Publication Number Publication Date
CN109670041A true CN109670041A (en) 2019-04-23

Family

ID=66144697

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811446969.6A Pending CN109670041A (en) 2018-11-29 2018-11-29 A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods

Country Status (1)

Country Link
CN (1) CN109670041A (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175645A (en) * 2019-05-27 2019-08-27 广西电网有限责任公司 A kind of method and computing device of determining protective device model
CN110188743A (en) * 2019-05-13 2019-08-30 武汉大学 A kind of taxi invoice identifying system and method
CN110222173A (en) * 2019-05-16 2019-09-10 吉林大学 Short text sensibility classification method and device neural network based
CN110705218A (en) * 2019-10-11 2020-01-17 浙江百应科技有限公司 Outbound state identification mode based on deep learning
CN110751216A (en) * 2019-10-21 2020-02-04 南京大学 Judgment document industry classification method based on improved convolutional neural network
CN110795536A (en) * 2019-10-29 2020-02-14 秒针信息技术有限公司 Short text matching method and device, electronic equipment and storage medium
CN111651974A (en) * 2020-06-23 2020-09-11 北京理工大学 Implicit discourse relation analysis method and system
CN111783998A (en) * 2020-06-30 2020-10-16 百度在线网络技术(北京)有限公司 Illegal account recognition model training method and device and electronic equipment
JP2021015549A (en) * 2019-07-16 2021-02-12 株式会社マクロミル Information processing method and information processing device
CN112989789A (en) * 2021-03-15 2021-06-18 京东数科海益信息科技有限公司 Test method and device of text audit model, computer equipment and storage medium
CN113449510A (en) * 2021-06-28 2021-09-28 平安科技(深圳)有限公司 Text recognition method, device, equipment and storage medium
CN113591440A (en) * 2021-07-29 2021-11-02 百度在线网络技术(北京)有限公司 Text processing method and device and electronic equipment
WO2021232725A1 (en) * 2020-05-22 2021-11-25 百度在线网络技术(北京)有限公司 Voice interaction-based information verification method and apparatus, and device and computer storage medium
CN114529259A (en) * 2022-02-18 2022-05-24 苏州浪潮智能科技有限公司 Conference summary auditing method, device, equipment and storage medium
JP2022537000A (en) * 2020-05-22 2022-08-23 バイドゥ オンライン ネットワーク テクノロジー(ペキン) カンパニー リミテッド Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction
CN116226357A (en) * 2023-05-09 2023-06-06 武汉纺织大学 Document retrieval method under input containing error information

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876968A (en) * 2010-05-06 2010-11-03 复旦大学 Method for carrying out harmful content recognition on network text and short message service
US9053431B1 (en) * 2010-10-26 2015-06-09 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
CN106844349A (en) * 2017-02-14 2017-06-13 广西师范大学 Comment spam recognition methods based on coorinated training
CN108062303A (en) * 2017-12-06 2018-05-22 北京奇虎科技有限公司 The recognition methods of refuse messages and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101876968A (en) * 2010-05-06 2010-11-03 复旦大学 Method for carrying out harmful content recognition on network text and short message service
US9053431B1 (en) * 2010-10-26 2015-06-09 Michael Lamport Commons Intelligent control with hierarchical stacked neural networks
CN106844349A (en) * 2017-02-14 2017-06-13 广西师范大学 Comment spam recognition methods based on coorinated training
CN108062303A (en) * 2017-12-06 2018-05-22 北京奇虎科技有限公司 The recognition methods of refuse messages and device

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
余本功等: "基于CP-CNN的中文短文本分类研究", 《计算机应用研究》 *
李少卿等: "不良文本变体关键词识别的词汇串相似度计算", 《计算机应用与软件》 *
毕银龙: "中文垃圾短文本的自动识别研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
陈鹏展等: "《个体行为的机器识别与决策协同》", 31 July 2018 *

Cited By (21)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110188743A (en) * 2019-05-13 2019-08-30 武汉大学 A kind of taxi invoice identifying system and method
CN110222173A (en) * 2019-05-16 2019-09-10 吉林大学 Short text sensibility classification method and device neural network based
CN110222173B (en) * 2019-05-16 2022-11-04 吉林大学 Short text emotion classification method and device based on neural network
CN110175645A (en) * 2019-05-27 2019-08-27 广西电网有限责任公司 A kind of method and computing device of determining protective device model
JP2021015549A (en) * 2019-07-16 2021-02-12 株式会社マクロミル Information processing method and information processing device
CN110705218A (en) * 2019-10-11 2020-01-17 浙江百应科技有限公司 Outbound state identification mode based on deep learning
CN110751216A (en) * 2019-10-21 2020-02-04 南京大学 Judgment document industry classification method based on improved convolutional neural network
CN110795536A (en) * 2019-10-29 2020-02-14 秒针信息技术有限公司 Short text matching method and device, electronic equipment and storage medium
WO2021232725A1 (en) * 2020-05-22 2021-11-25 百度在线网络技术(北京)有限公司 Voice interaction-based information verification method and apparatus, and device and computer storage medium
JP7266683B2 (en) 2020-05-22 2023-04-28 バイドゥ オンライン ネットワーク テクノロジー(ペキン) カンパニー リミテッド Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction
JP2022537000A (en) * 2020-05-22 2022-08-23 バイドゥ オンライン ネットワーク テクノロジー(ペキン) カンパニー リミテッド Information verification method, apparatus, device, computer storage medium, and computer program based on voice interaction
CN111651974B (en) * 2020-06-23 2022-11-01 北京理工大学 Implicit discourse relation analysis method and system
CN111651974A (en) * 2020-06-23 2020-09-11 北京理工大学 Implicit discourse relation analysis method and system
CN111783998A (en) * 2020-06-30 2020-10-16 百度在线网络技术(北京)有限公司 Illegal account recognition model training method and device and electronic equipment
CN111783998B (en) * 2020-06-30 2023-08-11 百度在线网络技术(北京)有限公司 Training method and device for illegal account identification model and electronic equipment
CN112989789A (en) * 2021-03-15 2021-06-18 京东数科海益信息科技有限公司 Test method and device of text audit model, computer equipment and storage medium
CN113449510A (en) * 2021-06-28 2021-09-28 平安科技(深圳)有限公司 Text recognition method, device, equipment and storage medium
CN113449510B (en) * 2021-06-28 2022-12-27 平安科技(深圳)有限公司 Text recognition method, device, equipment and storage medium
CN113591440A (en) * 2021-07-29 2021-11-02 百度在线网络技术(北京)有限公司 Text processing method and device and electronic equipment
CN114529259A (en) * 2022-02-18 2022-05-24 苏州浪潮智能科技有限公司 Conference summary auditing method, device, equipment and storage medium
CN116226357A (en) * 2023-05-09 2023-06-06 武汉纺织大学 Document retrieval method under input containing error information

Similar Documents

Publication Publication Date Title
CN109670041A (en) A kind of band based on binary channels text convolutional neural networks is made an uproar illegal short text recognition methods
CN108984530B (en) Detection method and detection system for network sensitive content
CN110807328B (en) Named entity identification method and system for legal document multi-strategy fusion
CN113254599B (en) Multi-label microblog text classification method based on semi-supervised learning
CN109446404B (en) Method and device for analyzing emotion polarity of network public sentiment
CN110134946B (en) Machine reading understanding method for complex data
CN109977416A (en) A kind of multi-level natural language anti-spam text method and system
CN109684642B (en) Abstract extraction method combining page parsing rule and NLP text vectorization
CN110287323B (en) Target-oriented emotion classification method
CN110532554A (en) A kind of Chinese abstraction generating method, system and storage medium
CN109635288A (en) A kind of resume abstracting method based on deep neural network
CN110415071B (en) Automobile competitive product comparison method based on viewpoint mining analysis
CN109002473A (en) A kind of sentiment analysis method based on term vector and part of speech
CN107688630B (en) Semantic-based weakly supervised microbo multi-emotion dictionary expansion method
CN108563638A (en) A kind of microblog emotional analysis method based on topic identification and integrated study
CN112434164B (en) Network public opinion analysis method and system taking topic discovery and emotion analysis into consideration
CN105975497A (en) Automatic microblog topic recommendation method and device
CN111444704B (en) Network safety keyword extraction method based on deep neural network
CN110826298A (en) Statement coding method used in intelligent auxiliary password-fixing system
CN107992468A (en) A kind of mixing language material name entity recognition method based on LSTM
CN107894976A (en) A kind of mixing language material segmenting method based on Bi LSTM
CN113220964B (en) Viewpoint mining method based on short text in network message field
CN116932736A (en) Patent recommendation method based on combination of user requirements and inverted list
CN111178009B (en) Text multilingual recognition method based on feature word weighting
CN111460147A (en) Title short text classification method based on semantic enhancement

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information

Inventor after: Zhou Jianzheng

Inventor after: Yao Jinliang

Inventor after: Ming Jianhua

Inventor after: Huang Jinhai

Inventor before: Zhou Jianzheng

Inventor before: Yao Jinliang

Inventor before: Huang Jinhai

Inventor before: Ming Jianhua

Inventor before: Yu Yuelun

CB03 Change of inventor or designer information
RJ01 Rejection of invention patent application after publication

Application publication date: 20190423

RJ01 Rejection of invention patent application after publication