CN110209795A

CN110209795A - Comment on recognition methods, device, computer readable storage medium and computer equipment

Info

Publication number: CN110209795A
Application number: CN201810594832.9A
Authority: CN
Inventors: 刘智静; 康斌
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-06-11
Filing date: 2018-06-11
Publication date: 2019-09-06

Abstract

This application involves a kind of comment recognition methods, device, computer readable storage medium and computer equipments, which comprises obtains comment to be identified；When there is no the words in illegal comment dictionary in the comment to be identified, but there are when word in doubtful illegal comment dictionary, then the comment to be identified is input in comment identification model；The comment type of the comment to be identified is exported, the comment type includes legal comment and illegal comment.This comment recognition methods for being combined strategy with model, legal comment can not only accurately be filtered out, avoid the operation of the erroneous judgement to legal comment, additionally it is possible to accurately identify to the type of the doubtful comment to be identified illegally commented on, the accuracy rate to comment identification is greatly improved.

Description

Comment on recognition methods, device, computer readable storage medium and computer equipment

Technical field

This application involves field of computer technology, more particularly to a kind of comment recognition methods, device, computer-readable deposit Storage media and computer equipment.

Background technique

With the development of computer science and technology, people are also increasingly longer using the time of network, using internet In the process, people can deliver oneself comment to a certain event by network.For example, people can send out a certain media event The view of table oneself, addition comment.

Occur the probability illegally commented on also relative increase, such as abuse comment as user comment is more and more, in comment Deng.These comments influence whether platform quality and user experience.Therefore, in the conventional technology, in order to solve to comment filtering Problem is usually to pre-establish specific sensitive vocabulary, will include the comment of any one vocabulary in sensitive vocabulary It is judged to illegally commenting on, and the illegal comment is filtered.

But the method that identification filtering is carried out to comment of traditional technology, since sensitive vocabulary is more, according to sensitive word Remittance table carries out comment knowledge and is easy to omit some comments for actually belonging to illegally comment on, and enriching due to language otherwise Property, the vocabulary belonged in sensitive vocabulary might not be exactly illegally to comment on, therefore the mode of traditional technology is easy to comment There are erroneous judgements.

Summary of the invention

Based on this, it is necessary to for the above technical issues, provide a kind of comment that can be improved comment recognition accuracy Recognition methods, device, computer readable storage medium and computer equipment.

A kind of comment recognition methods, comprising:

Obtain comment to be identified；

When the word being not present in illegal comment dictionary in the comment to be identified, but there are in doubtful illegal comment dictionary When word, then the comment to be identified is input in comment identification model；

The comment type of the comment to be identified is exported, the comment type includes legal comment and illegal comment.

A kind of comment identification device, described device include:

Comment obtains module, for obtaining comment to be identified；

Judgment module is commented on, there is no the words in illegal comment dictionary in the comment to be identified for working as, but exist doubtful When like the illegal word commented in dictionary, then the comment to be identified is input in comment identification model；

Identification module is commented on, for exporting the comment type of the comment to be identified, the comment type includes legal comments It is commented on by with illegal.

A kind of computer equipment can be run on a memory and on a processor including memory, processor and storage Computer program, the processor perform the steps of when executing the computer program

Obtain comment to be identified；

A kind of computer readable storage medium, is stored thereon with computer program, and the computer program is held by processor It is performed the steps of when row

Obtain comment to be identified；

Above-mentioned comment recognition methods, the comment to be identified that will acquire first determine in comment to be identified with the presence or absence of illegal The word in dictionary is commented on, if there is no the words then determined whether there is in doubtful illegal comment dictionary.If there is no illegal The word in dictionary is commented on, but there are the word in doubtful illegal comment dictionary, then needs to carry out further really comment to be identified It is fixed.Then can by it is to be identified comment be input to comment identification model in, by comment identification model to have determined be not belonging to it is non- Method comment, but belong to the doubtful comment to be identified illegally commented on and confirmed, confirm that comment to be identified is to belong to legal comment also It is illegally to comment on.It is this that comment is known otherwise, knowledge is treated by illegally commenting on dictionary and doubtful illegal comment dictionary After Ping Lun Double Selection not carried out, in conjunction with comment identification model, in different context, there may be the to be identified of different meanings Comment is further identified.This comment recognition methods for being combined strategy with model, can not only accurately sieve Legal comment is selected, avoids the operation of the erroneous judgement to legal comment, additionally it is possible to the class of the doubtful comment to be identified illegally commented on Type is accurately identified, and the accuracy rate to comment identification is greatly improved.

Detailed description of the invention

Fig. 1 is the applied environment figure that recognition methods is commented in one embodiment；

Fig. 2 is the flow diagram that recognition methods is commented in one embodiment；

Fig. 3 is the flow diagram that the generating mode step of identification model is commented in one embodiment；

Fig. 4 is the flow diagram of step 308 in one embodiment；

Fig. 5 is the data processing schematic diagram that identification model is commented in one embodiment；

Fig. 6 is the flow diagram in one embodiment before step 304；

Fig. 7 is the flow diagram that recognition methods is commented in another embodiment；

Fig. 8 is three kinds of accurate rate schematic diagrames to comment sample data processing mode in one embodiment；

Fig. 9 is the general diagram that recognition methods is commented in one embodiment；

Figure 10 is the structural block diagram that identification device is commented in one embodiment；

Figure 11 is the structural block diagram that identification device is commented in another embodiment；

Figure 12 is the structural block diagram of training module 1008 in one embodiment；

Figure 13 is the structural block diagram of computer equipment in one embodiment.

Specific embodiment

It is with reference to the accompanying drawings and embodiments, right in order to which the objects, technical solutions and advantages of the application are more clearly understood The application is further elaborated.It should be appreciated that specific embodiment described herein is only used to explain the application, and It is not used in restriction the application.

Fig. 1 is the applied environment figure that recognition methods is commented in one embodiment.Referring to Fig.1, the comment recognition methods application In comment identifying system.The comment identifying system includes comment data server 110 and comment identification server 120.Comment on number Pass through network connection according to server and comment identification server 120.Comment data server 110 and comment identification server 120 It can be realized with the server cluster of the either multiple server compositions of independent server.In comment data server In 110, multiple user comments are stored with, it, can when needing to carry out user comment in comment data server 110 comment identification Type identification is carried out to user comment by the comment identification model in comment identification server 110, determines that user comment is to belong to In legal comment or illegal comment.

As shown in Fig. 2, in one embodiment, providing a kind of comment recognition methods.The present embodiment is mainly in this way It is illustrated applied to the comment identification server 120 in above-mentioned Fig. 1.Referring to Fig. 2, the comment recognition methods specifically include as Lower step:

Step 202, comment to be identified is obtained.

Comment to be identified is to need to carry out the comment of type identification, identifies that the comment belongs to illegal comment or legal Comment.Comment to be identified includes but is not limited to the comment of user's input on each application terminal or webpage, such as to Identification comment can be user to the circle of friends Comment of wechat good friend, is also possible to user on news web page and comments news By.The comment for carrying out type identification is needed to all can serve as comment to be identified, it, can be with after server gets comment to be identified Type identification operation is carried out to comment to be identified.

Step 204, when the word being not present in illegal comment dictionary in comment to be identified, but there are doubtful illegal comment dictionaries In word when, then by it is to be identified comment be input to comment identification model in.

Include to be labeled as illegal word in advance in illegally comment dictionary, i.e., includes preparatory in illegally comment dictionary Be added illegal word, illegal word be all can be considered as illegal word in any one context.In doubtful illegal comment dictionary In include to be labeled as doubtful illegal word in advance, i.e., illegally comment dictionary in include the doubtful illegal word being previously added. It is considered as illegal word that doubtful illegal word, which refers in the context of part, and will not all be considered illegal in all contexts The word of word.This kind of illegal word in some contexts is collected in advance as doubtful illegal word, is stored in doubtful illegal comment dictionary In.The difference of doubtful illegal word and illegal word is that doubtful illegal word needs are judged according to specific context, is specifically transporting With under environment, doubtful illegal word may become illegal word, it is also possible to be not belonging to illegal word.And illegally word is then different, then is not required to Consider that the difference of context will lead to this word and be not belonging to illegal word, therefore illegal word belongs to very determination, belongs to illegal Word.For example, being legal word under some contexts, such as " there is hurrock to need here clear for " rubbish " this word Reason ", then " rubbish " is and under some contexts, and to be, " rubbish " in illegal word, such as " your this rubbish " as legal word It is then illegal word.

After comment to be identified is carried out word segmentation processing, multiple words can be obtained, be not present in when in obtained multiple words When the word of illegal comment dictionary, it is believed that comment to be identified is not belonging to illegally comment on.Comment to be identified can be carried out further Detection detects multiple words in comment to be identified with the presence or absence of the word in doubtful illegal dictionary.If in comment to be identified not In the presence of the word in illegal comment dictionary, but there are when word in doubtful illegal dictionary, comment to be identified can be input to comment In identification model, further detected.

Step 206, the comment type of comment to be identified is exported, comment type includes legal comment and illegal comment.

The word in illegal comment dictionary can be will be not present, but there are the to be identified of word in doubtful illegal comment dictionary to comment It is further identified by comment identification model is input to.Comment identification model can know the comment to be identified of input Not, determine that comment to be identified belongs to legal comment or illegal comment.That is, being input to be identified commenting in comment identification model By being that determination is not belonging to illegally comment on, but belong to the doubtful comment illegally commented on.And it can be true according to illegal comment dictionary Surely belong to illegal comment, or determine that the comment to be identified for belonging to legal comment is not need to be input in comment identification model to carry out Identification.

In one embodiment, doubtful illegal illegally in comment dictionary comprising hitting the illegal word that accuracy rate is more than threshold value Comment on the illegal word for being no more than threshold value in dictionary comprising hit accuracy rate；Hitting accuracy rate is in comment data comprising illegal word The ratio of illegal comment number and the comment number comprising illegal word.

Hit accuracy rate refers to the illegal comment number in comment data comprising illegal word and the comment number comprising illegal word Ratio.Determine the illegal word 1 for including in comment data, illegal word 2 ..., illegal word N, and count include each illegal word And determine and belong to the quantity X1 illegally commented on, that is, include the illegal comment number X1 of illegal word.Statistics includes each illegal word Comment number quantity X2, that is, include the comment number X2 of illegal word, the ratio of each illegal word corresponding X1 and X2 is that this is non- The corresponding hit accuracy rate of method word.

In one embodiment, as shown in figure 3, comment identification model generating mode the following steps are included:

Step 302, comment sample data is obtained.

Step 304, word segmentation processing is carried out to comment sample data, obtains multiple words.

Step 306, the multiple words for including in each comment sample data are sequentially input to comment identification model.

Step 308, comment identification model is trained by the word of input, obtains trained comment identification model.

In order to improve comment identification model to comment identification accuracy rate, can actual use comment identification model it Before, comment identification model is trained in advance.A large amount of comment sample data is first got, comment sample data can derive from The user comment of each application or each webpage.Word segmentation processing is carried out to each comment sample data, multiple words can be obtained, and will Sequence of each word according to word in comment sample data, is successively input to comment for each word for including in each sample data Identification model is trained comment identification model with this.That is, being input to the data being trained in comment identification model every time For the multiple words for including in each comment sample data.

For example, a certain comment sample data is " I likes Shenzhen University ", then word segmentation processing can be carried out to this comment.It can Obtain multiple words be respectively as follows: I/love/Shenzhen/university/, also available multiple words be respectively as follows: I/love/Shenzhen University/.Specifically Which word comment is split as depending on participle mode, above-mentioned split according to word, if split according to word, is obtained To multiple words be then respectively as follows: I/love/depth/ditch between fields it is/big/learn/.

After comment identification model training, trained comment identification model can be obtained.In advance using a large amount of Comment sample data is trained comment identification model, and it is accurate for the identification of comment to be identified that comment identification model can be improved Rate.

In one embodiment, the mode for carrying out word segmentation processing to comment sample data includes any of the following: will be every A comment sample data is split according to each word；Word segmentation processing is carried out to comment sample data using participle tool.

After getting multiple comment sample datas, needs to carry out word segmentation processing to each comment sample data, obtain Multiple words are converted into after term vector being input to again in comment identification model and be trained.Divide to each comment sample data When word processing, there are two types of modes, the first is to split each comment sample data according to each word, be can be obtained with this more A word is referred to as being handled according to word granularity, i.e., is word for word split to each comment sample data.For example it a certain comments By sample data be " I am in high mood today ", if that is split according to word granularity, it is removable be divided into I/the present/day/heart/feelings/ Very/good.

The second way is that comment sample data is input to participle tool using participle tool, will by participle tool Comment sample data splits into multiple words.So, it can be achieved that the word segmentation processing for commenting on sample data.With participle tool to commenting It, can be by each comment using a kind of library jieba (third party's dictionary) of participle tool when carrying out word segmentation processing by sample data It is all in sample data all to be scanned at the word of word.For example a certain comment sample data is that " I likes that Shenzhen is big Learn ", then using participle tool be split as I/love/Shenzhen University, can also be split as I/love/Shenzhen/university.Tool Body splits the participle mode that situation depends on jieba participle, for example the participle mode of jieba can have accurate model, will be each Comment sample data is split in very accurate way；There can also be syntype, can will own in each comment sample data Can at word word all scan come；Can also have search engine mode, i.e., on the basis of accurate model, to long word into Row is split again, improves recall rate.

In one embodiment, above-mentioned steps 306, comprising: the phonetic of each word is obtained, successively by each comment sample number The corresponding phonetic of each word for including in is sequentially input to comment identification model.

During practice, the data being input in comment identification model are comments, and in the training process, input Data to comment identification model be will comment on split after obtained multiple words.Word, which can be, carries out comment sample data The multiple words word for word split are also possible to carry out multiple words that word segmentation processing obtains to comment sample data.In this implementation In example, after the multiple words for obtaining each comment sample data, the phonetic of each word can be got, is being commented on according still further to each word Sequence in this sentence of sample, successively by the corresponding Pinyin Input of each word for including in this sentence to commenting on identification model It is trained.This processing mode can also be known as handling comment sample data according to phonetic granularity, i.e., will be every A comment sample data carry out dividing by means of characters or word segmentation processing after, using the phonetic of word or word as comment on identification model input data, Comment identification model is trained.

Use phonetic as when input, can detecte out the illegal word of many unisonances.For example there are " paper in user comment " this word, but based on context it is found that user, which actually to be expressed, means " amentia ".In this case, if made With phonetic, then avoid to illegal omission.Because many times user may use homonym that comment is avoided to be shielded, will spell Sound words as input reduce the problem of illegal word is missed detection in the detectability that can increase comment identification model.

In one embodiment, as shown in figure 4, step 308, comprising:

Step 402, the word of input is converted into the term vector of equal length according to the word embeding layer of comment identification model.

Word insertion is that a kind of type of word indicates, the word with similar import has similar expression, and word insertion is by word It is mapped to the method general name of real vector.There is word embeding layer in comment identification model, can be used for for the word of input being converted into Each word can be converted into the identical term vector of length by vector, specifically, word embeding layer.For example, the word of input are as follows: Beijing/joyous Meet/you, then these three words of input can be converted into the vector of 128 dimensions by word embeding layer, then the corresponding word of these three words to Amount is the term vector of equal length.

Step 404, the term vector of equal length is input to convolutional layer, the parameter of the convolution kernel of convolutional layer is adjusted It is whole, convolution algorithm is carried out to the term vector of equal length using the parameter of convolution kernel adjusted.

Word embeding layer by input each word be converted into the term vector of equal length after, can by the word of equal length to Amount is input to convolutional layer, the input data as convolutional layer.After input data to convolutional layer, the parameter of convolutional layer can be adjusted It is whole, so as to carry out convolution algorithm using data of the parameter of convolution kernel adjusted to input.Convolutional layer can be used into Row feature extraction, after the convolution algorithm of convolutional layer, the available multiple characteristic patterns of each word of input, characteristic pattern is pair The feature that the word of input extracts.

There is convolution kernel in convolutional layer, convolution kernel belongs to the weight parameter of convolutional layer, convolutional layer to the data of input into When row feature extraction, to the size of feature extraction, and extracting feature to carry out the parameter of convolution algorithm later is according to convolution kernel Determining.That is, the length and number of adjustment convolution kernel, are equivalent to the parameter of adjustment convolutional layer.Therefore, the length of convolution kernel and Number can be customized by technical staff, and the width of convolution kernel can be set to the length of term vector.Therefore every time to multiple words to When amount carries out convolution algorithm, similar to the statistical language model (n-gram) in natural language processing, it is contemplated that the word of input Order information in sentence.In the training process, technical staff can be according to the output data of output layer each time to convolution kernel It is adjusted, so as to adjust the ability in feature extraction of convolutional layer, and carries out the accuracy of the characteristic pattern obtained after convolution algorithm.

Step 406, characteristic pattern convolution algorithm obtained is input to pond layer and carries out pond effect, by each characteristic pattern Maximum value is input to recurrence layer.

After obtaining the corresponding multiple characteristic patterns of each word by the convolution algorithm of convolutional layer, characteristic pattern can be input to pond Layer, the input data as pond layer.Pond layer can compress the characteristic pattern of input, can characteristic pattern be become smaller, The computation complexity for simplifying comment identification model, is also possible to compress the feature of characteristic pattern, extracts the master of characteristic pattern Want feature.That is, pond layer can carry out pond effect to the characteristic pattern of input.After carrying out pond effect, pond layer can be looked into The maximum value of each characteristic pattern region is found, and using the maximum value of each characteristic pattern as output data, output to recurrence Layer, as the input data for returning layer.

Step 408, two probability that layer output will be returned, the input data as output layer.

Step 410, by output layer using the maximum probability of numerical value as output, determine that output layer exports according to probability threshold value The corresponding comment of probability type identification result.

Layer is returned after getting the maximum value of each characteristic pattern of pond layer output, it can be according to whole characteristic patterns Maximum value obtains whole characteristic patterns and belongs to legal comment or illegally comment on corresponding probability.When probability is higher, explanation belongs to A possibility that corresponding classification, is bigger.It is respectively as follows: legal 0.6 and illegal 0.4 for example, returning layer and obtaining probability, then illustrates to return The conclusion of layer is that the probability for determining that the corresponding word of whole characteristic patterns belongs to legal word is 0.6, and the probability for belonging to illegal word is 0.4.Layer is returned after obtaining the two probability, can be exported to output layer, i.e. the input data of output layer is to return layer to obtain Two probability.

Output layer can judge two probability of input, using the highest probability of numerical value as output data.Work as acquisition To after the output data of output layer, the probability that output layer exports can be compared with preset probability threshold value, according to probability threshold It is worth the type identification result for determining the corresponding comment of the probability of output layer output.For example, probability threshold value is preset as 0.5, work as output When the probability of layer output is greater than 0.5, it is believed that the corresponding comment of multiple words of input belongs to legal comment, conversely, probability is less than 0.5 Corresponding word belongs to illegal comment.Probability threshold value can be by technical staff's sets itself.

As shown in figure 5, a certain comment sample data is " wait for the video and don ' t rent it ", it will After commenting on sample data progress word segmentation processing, sequentially input to the word embeding layer of comment identification model.The input number of word embeding layer According to for wait/for/the/video/and/do/n ' t/rent/it, it is identical that these words can be converted into length by word embeding layer Term vector, output is to convolutional layer.Convolutional layer can carry out feature extraction to term vector, have multiple and different window sizes in convolutional layer Filter, carry out feature extraction operation.As shown in Figure 5, the filter of convolutional layer can to the two words of wait and for Region where amount carries out feature extraction, can also be to the term vector region progress feature spy of these three words of video, and, do Sign.The number of filter customized can be arranged, i.e., the number of convolution kernel customized can be arranged.It is same in convolutional layer Filter parameter can be shared, and can reduce the number of parameter in this way.In general, a filter can only identify same class spy Sign, i.e. a filter are exactly a kind of feature identifier.

After convolutional layer is to the feature extraction of term vector, exportable corresponding multiple characteristic patterns.The quantity of characteristic pattern takes 50 certainly are set as in the number of convolution kernel, such as the number of convolution kernel, then convolutional layer can export 50 characteristic patterns.Pond layer it is defeated Enter the output data that data are convolutional layer, i.e. characteristic pattern.Pond layer can carry out pond effect to each characteristic pattern, and take each spy Levy the maximum value of figure.The maximum value of characteristic pattern refers to that this feature is unrelated with appearance position, no matter which this feature appears in In, belong to strongest feature.After the effect of the pondization of pond layer, the maximum value of each characteristic pattern can be obtained, and be input to Return layer.

Whole characteristic patterns can be obtained and belong to legal comment or illegal according to the maximum value of whole characteristic patterns by returning layer Comment on corresponding probability.When probability is higher, illustrate that a possibility that belonging to corresponding classification is bigger.For example 5 features are obtained The maximum value of figure, then returning layer can be according to the maximum value of this 5 characteristic patterns, the type of the corresponding comment of judging characteristic figure, judgement The comment belongs to legal comment or illegal comment.Two probability can be exported by returning layer, and the two probability addition exported Be 1, the two probability have respectively represented two possible probability of classification.Two probability for returning layer output represent recurrence layer Judging result, such as return layer export legal comment probability be 0.8, the probability illegally commented on be 0.2, then illustrate return layer Judging characteristic figure has 80% possibility to belong to legal comment, has 20% possibility to belong to illegal comment.The input data of output layer For the output data for returning layer, output layer, can be using the maximum probability of numerical value as output when getting the probability of two classification Data.

Unlike in practice, it is one that the input data that identification model is commented in practice, which is comment, Complete sentence, output layer output is one as a result, judging whether the comment of input belongs to legal comment.And in training process In, input data is multiple words in a comment, and output data is the probability for judging whether to belong to legal comment.Because instructing During white silk, it can be adjusted according to parameter of the probability that output layer exports to convolutional layer and recurrence layer.Such as a certain comment sample Include multiple illegal words in notebook data, this is commented on into the multiple illegal words for including in sample data and is input to comment identification mould It is trained in type, it is 0.2 that output layer, which exports this comment to belong to the probability illegally commented on,.In this case, then it needs pair Convolutional layer and the parameter for returning layer are adjusted, to improve the recognition accuracy of comment identification model.

It first passes through a large amount of comment sample data to be trained comment identification model in advance, then trained comment is known Other model is put into practice, and the accuracy rate that comment identification model judges user comment type is greatly improved.

In one embodiment, the parameter of the convolution kernel of convolutional layer is adjusted, comprising: to the length of convolution kernel and Number is adjusted, and the length of convolution kernel is less than or equal to 10, and the number of convolution kernel is less than or equal to 200.

It, can be according to the probability that output layer exports to being rolled up in convolutional layer during being trained to comment identification model The parameter of product core is adjusted.In traditional technology, it is 3,4,5 that convolution kernel, which is generally arranged to length, and number is arranged to each 100 It is a, however the recognition accuracy that such parameter will lead to comment identification model is lower.Therefore, in the present embodiment, in order to mention The recognition accuracy of higher assessment opinion identification model, adjusts the parameter of convolution kernel.The length of convolution kernel is set smaller than Or it is equal to 10, the number of convolution kernel is set smaller than or equal to 200.Specifically, 1,2 can be set by the length of convolution kernel, 3,4,5,6,8, the number of convolution kernel is respectively 50,100,150,150,200,150,100.This set refers to that length is 1 Convolution kernel is 50, and the convolution kernel that length is 2 is 100, and the convolution kernel that length is 3 is 150, and so on.

In one embodiment, further comprising the steps of as shown in fig. 6, before step 304:

Step 602, by include English comment sample data in English be converted into small letter.

Step 604, the spcial character in removal comment sample data.

Step 606, truncation is carried out to each comment sample according to default truncation length.

When being trained comment identification model, need to carry out word segmentation processing to comment sample data.The step for Before, comment sample data can also be pre-processed.Preprocessing process includes: the first step, will include the comment of English English in sample data is converted into small letter.Such as some comment sample data be " my today very Happy~！", then It can will comment on the English for including in sample data and be converted into small letter, is i.e. comment sample data is converted to that " my is very today Happy~！".The spcial character in sample data is commented in second step, removal.Spcial character may include emoticon, punctuate symbol Number, the character in addition to Chinese and English and number can be disposed as spcial character by face text etc., under normal circumstances, technical staff. Then comment on sample data be converted to " my today very happy~！" after, after the removal for carrying out spcial character, become " my today Very happy ".Third step carries out truncation to each comment sample according to default truncation length.Default truncation length can be with Customized setting is traditionally arranged to be 200-300 byte, each Chinese, English or number can be regarded as a byte, such as " I The very byte number of happy " today is 10.Truncation is directed to longer comment, for example a certain comment data is one section When long text, then needs to carry out truncation to this section of comment, this section of comment is split into multiple sentences.

Due in comment, general illegal word can all appear in the end of comment, therefore when carrying out truncation, can be from The tip forward end of comment is truncated.Comment sample data is handled in advance, then carries out word-breaking processing, the word that will be obtained It is input to comment identification model to be trained, reduces in input data the irrelevant informations such as spcial character for the shadow of recognition result It rings, enhances the pure property of training data, to improve the comment identification model after training for the standard of comment type identification True rate.

In one embodiment, comment identification model is text classification neural network.

Text classification neural network (TextCNN) is a kind of nerve net classified using convolutional neural networks to text Network.For the convolutional neural networks (Char-CNN) and shot and long term memory network (LSTM) of character, text classification nerve Network is highest for the recognition accuracy of comment.Convolutional layer in text classification neural network, when carrying out convolution algorithm, Similar to the statistical language model (n-gram) in natural language processing, it is contemplated that sequence letter of the word of input in sentence Breath.Therefore when carrying out feature extraction, actually there is the location information in view of each term vector in former comment.And by In the rich of text, judge whether a word is illegal word, so as to cause the comment where the word whether be when illegally commenting on, It is the context needed in view of context.

For example, a certain comment is " all only oneself can cannot be benumbed daily, without doing a practical work ", and in this comment, " fiber crops A numbness " word is not illegal word.And another comment is " paralysis！You are fool? ", in this comment, " paralysis " word is then Belong to illegal word.Therefore the word that comment identification model is inputted by judgement, to judge whether the corresponding comment of this word belongs to When illegal comment, need in view of location information of each word in comment.And text classification neural network then considers this A bit, in text classification neural network, when carrying out feature extraction to term vector, be sequence according to term vector in comment into What row successively extracted, be not in take across selected ci poem.To ensure that when carrying out type judgement, it is contemplated that the context of word is asked Topic.

As shown in fig. 7, in one embodiment, providing a kind of comment recognition methods.The present embodiment is mainly in this way It is illustrated applied to the comment identification server 120 in above-mentioned Fig. 1.Referring to Fig. 7, the comment recognition methods specifically include as Lower step:

Step 702, comment sample data is obtained, comment sample data is pre-processed.

Step 704, word segmentation processing is carried out to comment sample data, obtains multiple words.

Step 706, the multiple words for including in each comment sample data are sequentially input to comment identification model, to comment Identification model is trained.

When practice comments on identification model, comment identification model can be trained in advance.It can be answered by various With or webpage get a large amount of comment sample data, for example Tencent's news, the user everyday in flash report below news article are commented By.After getting comment sample data, comment sample data can be labeled, comment sample data is divided into illegal comment With legal comment.And comment sample data is divided into training data, verify data and test data.Training data is used for comment Identification model is trained, and verify data is used to verify the accuracy of the comment identification model after training, and test data is then used for Comment identification model after verifying is tested.When comment verifying of the identification model by verify data, and pass through test number According to test after, just at last comment identification model really training finishes.

To comment sample data pre-process, including by include English comment sample data in English be converted into Small letter；Spcial character in removal comment sample data；Truncation is carried out to each comment sample according to default truncation length. Word segmentation processing can be carried out to pretreated comment sample data has been carried out.Word segmentation processing can be there are three types of mode, the first It is that comment sample data is word for word split, each comment sample data is split as multiple words, using word as commenting according to word granularity By the input of identification model.Second is to segment according to word granularity using participle tool, such as jieba, will be every by participle A comment sample data is split as multiple words, using word as the input of comment identification model.The third be according to phonetic granularity, After carrying out dividing by means of characters or word segmentation processing to comment sample data, using the phonetic of each word as the input of comment identification model.Due to In illegal comment, be often related to common stop words, such as " you ", " he ", " I " etc., thus carry out dividing by means of characters or When word segmentation processing, stop words can not be removed, reduces the loss of information.

Table as shown in Figure 8, accurate rate refers to comment on to be really belonged to illegally in the illegal comment that identification model identifies The ratio of the comment number of comment and the illegal comment data identified.Recall rate refers to what comment identification model identified The comment number illegally commented on is really belonged in illegal comment and comments on the illegal comment number for including in sample data.F value be for Accurate rate and recall rate are measured for the numerical value of overall discrimination, i.e. F value is calculated according to accurate rate and recall rate.F The calculation formula of value can be with are as follows: F=2/ (+1/ recall rate of 1/ accurate rate).

For example, belonging to the quantity illegally commented on is 100 in whole comment sample datas, comment identification model is identified, It is considered that the quantity illegally commented on is 120, and is considered in the comment illegally commented in comment identification model, actually illegally comments The quantity of opinion is 90.So accurate rate=90/120, recall rate=90/100.So F value is 0.82.In this embodiment, it attempts Three kinds of processing modes to comment sample data, the corresponding accurate rate of each processing mode, recall rate and F value, such as Fig. 8 institute Show.It is divided according to word granularity, the accurate rate being trained to comment sample data is 85.32%, and recall rate is 86.18%, F value are 0.86.It being divided according to word granularity, the accurate rate being trained to comment sample data is 86.09%, Recall rate is that 73.01%, F value is 0.79.It is divided according to phonetic granularity, the accurate rate that comment sample data is trained It is 85.28%, recall rate 68.89%, F value is 0.75.It follows that according to the processing mode that word is split, it will be to comment sample The word that notebook data is handled is input to comment identification model when being trained, F value highest.Therefore, technical staff can be with The preferential processing mode by word granularity handles comment sample data.It, can after handling comment sample data By the Pinyin Input of obtained multiple words or word into comment identification model, comment identification model is trained.Comment is known Other model can be text classification neural network.

Step 708, illegal comment dictionary and doubtful illegal comment dictionary are established.

Step 710, comment to be identified is obtained.

Step 712, when the word being not present in illegal comment dictionary in comment to be identified, but there are doubtful illegal comment dictionaries In word when, then by it is to be identified comment be input to comment identification model in.

Step 714, comment to be identified is identified by commenting on identification model, exports the comment class of comment to be identified Type.

Before carrying out comment identification, illegal comment dictionary and doubtful illegal comment dictionary can be pre-established.Illegal comment Include largely to be labeled as illegal word in dictionary, similarly, in doubtful illegal comment dictionary include largely be labeled as it is doubtful Illegal word, i.e. these words may belong to legal word under certain contexts, also may belong to illegal word under certain special contexts. In comment to be identified, if there is the word in illegal comment dictionary, then illustrates that the comment to be identified belongs to illegal comment, need It further to examine, whether examine has the word belonged in doubtful illegal comment dictionary in comment to be identified.If occurring doubting Like the word in illegal comment dictionary, illustrate that the comment to be identified belongs to doubtful illegal comment, that is, belonging to can not judge specifically not It is illegally to comment on, judges then the comment to be identified can be input in comment identification model.

As shown in figure 9, get comment to be identified first, judges whether to have in comment to be identified and belong in illegal dictionary Word.If so, then illustrating that comment to be identified belongs to illegal comment；Belong to if it is not, judging whether to have in comment to be identified Word in doubtful illegal comment dictionary.If so, then illustrating that comment to be identified belongs to doubtful illegal comment, need to comment to be identified It tests by comment identification model is input to；If it is not, illustrating that the comment to be identified belongs to legal comment.It will belong to doubtful After being input in comment identification model like illegal comment, comment identification model meeting output test result belongs to legal comment also It is illegally to comment on.

It in the present embodiment, can be double to comment to be identified progress by illegally commenting on dictionary and doubtful illegal comment dictionary It screens again, can will determine that the comment to be identified for belonging to legal comment carries out screening exclusion, avoids by illegally commenting on dictionary Operation is accidentally injured, can filter out in different context that there may be the to be identified of different meanings to comment by doubtful illegal comment dictionary By.It is filtered out by illegally commenting on dictionary and doubtful illegal comment dictionary, determination is not legal comment, but belongs to and doubtful illegally comment The comment to be identified of opinion is input to comment identification model and is detected, then is not belonging to legal comment to these by commenting on identification model By but belong to the doubtful comment to be identified illegally commented on and carry out accurate type identification.It is this to be combined strategy with model Recognition methods is commented on, legal comment can not only be accurately filtered out, avoids the operation of the erroneous judgement to legal comment, additionally it is possible to right The type of the doubtful comment to be identified illegally commented on is accurately identified, and the accuracy rate to comment identification is greatly improved. Also, technical staff can be adjusted according to parameter of the different actual needs to comment identification model, such as identify to comment The length and number of the convolution kernel of model are adjusted, and can also be adjusted to the parameter of the recurrence layer of comment identification model, To which the ability in feature extraction to comment identification model is carried out with the ability and recognition accuracy identified to the comment of input Adjustment, so that comment identification model is in the state constantly evolved, it is also increasingly stronger to the type identification ability of comment to be identified.

In one embodiment, comment identification model can be packaged into interface, can be applicable in each comment auditing system, Such as the comment auditing system at news end.Can be used for identifying the user comment in news or webpage, and will comment on into Row sequence.For example the user comment for being determined as legal comment is come before comment list, the use that will be judged to illegally commenting on Family comment is then shielded or is come the end of comment list.Comment auditing system connects what calling comment identification model was packaged into When mouth, screening and filtering can be carried out to user comment in advance with the illegal comment dictionary and doubtful illegal comment dictionary established.It will There is no the comments of the word in illegal comment dictionary, that is, belong to legal comment and be filtered, and will be present and doubtful illegally commented It is detected by the comment calling interface of the word in dictionary, the user comment that interface is detected as illegally commenting on is intercepted.Subtract The light pressure of manual examination and verification, while the result that can collect manual examination and verification is timed update to comment identification model, can be into One step improves the recognition accuracy of comment identification model, forms the process of benign cycle.

Fig. 2-Fig. 7 is the flow diagram that recognition methods is commented in each embodiment.Although should be understood that each figure Flow chart in each step successively show that but these steps are not inevitable to indicate according to arrow according to the instruction of arrow Sequence successively execute.Unless expressly stating otherwise herein, there is no stringent sequences to limit for the execution of these steps, these Step can execute in other order.Moreover, at least part step in each figure may include multiple sub-steps or Multiple stages, these sub-steps or stage are not necessarily to execute completion in synchronization, but can be at different times Execute, these sub-steps perhaps the stage execution sequence be also not necessarily successively carry out but can with other steps or its The sub-step or at least part in stage of its step execute in turn or alternately.

In one embodiment, as shown in Figure 10, a kind of comment identification device is provided, comprising:

Comment obtains module 1002, for obtaining comment to be identified.

Judgment module 1004 is commented on, there is no the words in illegal comment dictionary in comment to be identified for working as, but exist doubtful When like the illegal word commented in dictionary, then comment to be identified is input in comment identification model.

Comment on identification module 1006, for exporting the comment type of comment to be identified, comment type include legal comment and Illegal comment.

In one embodiment, as shown in figure 11, above-mentioned apparatus further includes training module 1008, for obtaining comment sample Data；Word segmentation processing is carried out to comment sample data, obtains multiple words；By it is each comment sample data in include each word according to It is secondary to be input to comment identification model；Comment identification model is trained by the word of input, obtains trained comment identification Model.

In one embodiment, above-mentioned training module 1008 is also used to through any one following mode to comment sample number According to progress word segmentation processing: each comment sample data is split according to each word；Using participle tool to comment sample number According to progress word segmentation processing.

In one embodiment, above-mentioned training module 1008 is also used to carry out each comment sample data according to each word It splits；Word segmentation processing is carried out to comment sample data using participle tool.

In one embodiment, above-mentioned training module 1008 is also used to obtain the phonetic of each word, successively by each comment The corresponding phonetic of each word for including in sample data is sequentially input to comment identification model.

In one embodiment, as shown in figure 12, above-mentioned training module 1008, comprising:

Term vector conversion module 1008A, for the word of input to be converted into phase according to the word embeding layer of comment identification model With the term vector of length.

Convolution module 1008B, for the term vector of equal length to be input to convolutional layer, to the ginseng of the convolution kernel of convolutional layer Number is adjusted, and carries out convolution algorithm to the term vector of equal length using the parameter of convolution kernel adjusted.

Pond module 1008C, the characteristic pattern for obtaining convolution algorithm are input to pond layer and carry out maximum pondization effect, The maximum value of each characteristic pattern is input to recurrence layer.

Output module 1008D, the input data for two probability of layer output will to be returned, as output layer；By defeated Layer determines that the type of the corresponding comment of probability of output layer output is known according to probability threshold value using the maximum probability of numerical value as output out Other result.

In one embodiment, above-mentioned convolution module 1008B is also used to be adjusted the length and number of convolution kernel, volume The length of product core is less than or equal to 10, and the number of convolution kernel is less than or equal to 200.

In one embodiment, above-mentioned training module 1008 further includes preprocessing module (not shown), for that will wrap English in comment sample data containing English is converted into small letter；Spcial character in removal comment sample data；According to pre- Truncation is carried out to each comment sample if length is truncated.

In one embodiment, above-mentioned comment identification model is text classification neural network.

Figure 13 shows the internal structure chart of computer equipment in one embodiment.The computer equipment specifically can be figure Comment in 1 identifies server 120).As shown in figure 13, it includes total by system which, which includes the computer equipment, Processor, memory and the network interface of line connection.Wherein, memory includes non-volatile memory medium and built-in storage.It should The non-volatile memory medium of computer equipment is stored with operating system, can also be stored with computer program, the computer program When being executed by processor, processor may make to realize comment recognition methods.Computer program can also be stored in the built-in storage, When the computer program is executed by processor, processor may make to execute comment recognition methods.

It will be understood by those skilled in the art that structure shown in Figure 13, only part relevant to application scheme The block diagram of structure, does not constitute the restriction for the computer equipment being applied thereon to application scheme, and specific computer is set Standby may include perhaps combining certain components or with different component layouts than more or fewer components as shown in the figure.

In one embodiment, comment identification device provided by the present application can be implemented as a kind of shape of computer program Formula, computer program can be run in computer equipment as shown in fig. 13 that.Composition can be stored in the memory of computer equipment Each program module of the comment identification device is known for example, comment shown in Fig. 10 obtains module, comment judgment module and comment Other module.The computer program that each program module is constituted makes processor execute each reality of the application described in this specification Apply the step in the comment recognition methods of example.

For example, computer equipment shown in Figure 13 can be obtained by the comment in comment identification device as shown in Figure 10 Module, which executes, obtains comment to be identified.Computer equipment can be executed by comment judgment module when there is no non-in comment to be identified Method comments on the word in dictionary, but there are when word in doubtful illegal comment dictionary, then comment to be identified is input to comment identification In model.Computer equipment can execute the comment type for exporting comment to be identified by comment identification module, and comment type includes Legal comment and illegal comment.

In one embodiment, a kind of computer equipment, including memory and processor are provided, is stored in memory Computer program, which performs the steps of when executing computer program obtains comment to be identified；When in comment to be identified There is no the words in illegal comment dictionary, but there are when word in doubtful illegal comment dictionary, are then input to comment to be identified It comments in identification model；The comment type of comment to be identified is exported, comment type includes legal comment and illegal comment.

In one embodiment, the step of computer program also makes processor execute the generation of comment identification model, packet It includes: obtaining comment sample data；Word segmentation processing is carried out to comment sample data, obtains multiple words；By each comment sample data In include multiple words sequentially input to comment identification model；Comment identification model is trained by the word of input, is obtained Trained comment identification model.

In one embodiment, the multiple words for including in each comment sample data are sequentially input to comment and identifies mould Type, comprising: the phonetic of each word is obtained, it is successively that the corresponding phonetic of each word for including in each comment sample data is successively defeated Enter to comment identification model.

In one embodiment, comment identification model is trained by the word of input, comprising: mould is identified according to comment The word of input is converted into the term vector of equal length by the word embeding layer of type；The term vector of equal length is input to convolutional layer, The parameter of the convolution kernel of convolutional layer is adjusted, the term vector of equal length is carried out using the parameter of convolution kernel adjusted Convolution algorithm；The characteristic pattern that convolution algorithm is obtained is input to pond layer and carries out pond effect, by the maximum value of each characteristic pattern It is input to recurrence layer；Two probability of layer output will be returned, the input data as output layer；It is by output layer that numerical value is maximum Probability as output, the type identification result of the corresponding comment of probability of output layer output is determined according to probability threshold value.

In one embodiment, before carrying out word segmentation processing to comment sample data, computer program also to handle Device execute following steps: by include English comment sample data in English be converted into small letter；Removal comment sample data In spcial character；Truncation is carried out to each comment sample according to default truncation length.

In one embodiment, a kind of computer readable storage medium is provided, computer program is stored thereon with, is calculated Machine program performs the steps of when being executed by processor obtains comment to be identified；When there is no illegal comments in comment to be identified Word in dictionary, but there are when word in doubtful illegal comment dictionary, then comment to be identified is input in comment identification model； The comment type of comment to be identified is exported, comment type includes legal comment and illegal comment.

In one embodiment, the generation step of following comment identification model is also realized when computer program is executed by processor It is rapid: to obtain comment sample data；Word segmentation processing is carried out to comment sample data, obtains multiple words；By each comment sample data In include multiple words sequentially input to comment identification model；Comment identification model is trained by the word of input, is obtained Trained comment identification model.

In one embodiment, before carrying out word segmentation processing to comment sample data, computer program also makes processor Execute following steps: by include English comment sample data in English be converted into small letter；In removal comment sample data Spcial character；Truncation is carried out to each comment sample according to default truncation length.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, the program can be stored in a non-volatile computer and can be read In storage medium, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, provided herein Each embodiment used in any reference to memory, storage, database or other media, may each comprise non-volatile And/or volatile memory.Nonvolatile memory may include that read-only memory (ROM), programming ROM (PROM), electricity can be compiled Journey ROM (EPROM), electrically erasable ROM (EEPROM) or flash memory.Volatile memory may include random access memory (RAM) or external cache.By way of illustration and not limitation, RAM is available in many forms, such as static state RAM (SRAM), dynamic ram (DRAM), synchronous dram (SDRAM), double data rate sdram (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronization link (Synchlink) DRAM (SLDRAM), memory bus (Rambus) directly RAM (RDRAM), straight Connect memory bus dynamic ram (DRDRAM) and memory bus dynamic ram (RDRAM) etc..

Each technical characteristic of above embodiments can be combined arbitrarily, for simplicity of description, not to above-described embodiment In each technical characteristic it is all possible combination be all described, as long as however, the combination of these technical characteristics be not present lance Shield all should be considered as described in this specification.

The several embodiments of the application above described embodiment only expresses, the description thereof is more specific and detailed, but simultaneously The limitation to the application the scope of the patents therefore cannot be interpreted as.It should be pointed out that for those of ordinary skill in the art For, without departing from the concept of this application, various modifications and improvements can be made, these belong to the guarantor of the application Protect range.Therefore, the scope of protection shall be subject to the appended claims for the application patent.

Claims

1. a kind of comment recognition methods, comprising:

Obtain comment to be identified；

When the word being not present in illegal comment dictionary in the comment to be identified, but there are the words in doubtful illegal comment dictionary When, then the comment to be identified is input in comment identification model；

2. the method according to claim 1, wherein being more than comprising hit accuracy rate in the illegal comment dictionary The illegal word of threshold value is no more than the illegal word of the threshold value in the doubtful illegal comment dictionary comprising the hit accuracy rate；

The hit accuracy rate is the illegal comment number in comment data comprising illegal word and the comment number comprising the illegal word Ratio.

3. the method according to claim 1, wherein the generating mode of the comment identification model includes:

Obtain comment sample data；

Word segmentation processing is carried out to the comment sample data, obtains multiple words；

The multiple words for including in each comment sample data are sequentially input to comment identification model；

The comment identification model is trained by the word of input, obtains trained comment identification model.

4. according to the method described in claim 3, it is characterized in that, described carry out word segmentation processing to the comment sample data Mode includes any of the following:

Each comment sample data is split according to each word；

Word segmentation processing is carried out to the comment sample data using participle tool；

It is described to sequentially input the multiple words for including in each comment sample data to comment identification model, comprising:

Obtain the phonetic of each word, successively by it is each comment sample data in include the corresponding phonetic of each word sequentially input to Comment on identification model.

5. according to the method described in claim 3, it is characterized in that, the word by input to the comment identification model into Row training, comprising:

The word of input is converted into the term vector of equal length according to the word embeding layer of the comment identification model；

The term vector of the equal length is input to convolutional layer, the parameter of the convolution kernel of the convolutional layer is adjusted, benefit Convolution algorithm is carried out with term vector of the parameter of convolution kernel adjusted to the equal length；

The characteristic pattern that convolution algorithm is obtained is input to pond layer and carries out pond effect, and the maximum value of each characteristic pattern is input to Return layer；

Input data by two probability of the recurrence layer output, as output layer；

By the output layer using the maximum probability of numerical value as output, the general of the output layer output is determined according to probability threshold value The type identification result of the corresponding comment of rate.

6. according to the method described in claim 5, it is characterized in that, the parameter of the convolution kernel to the convolutional layer is adjusted It is whole, comprising:

The length and number of the convolution kernel are adjusted, the length of the convolution kernel is less than or equal to 10, the convolution kernel Number be less than or equal to 200.

7. according to method described in claim 3 to 6 any one, which is characterized in that carried out to the comment sample data Before word segmentation processing, further includes:

By include English comment sample data in English be converted into small letter；

Remove the spcial character in the comment sample data；

Truncation is carried out to each comment sample according to default truncation length.

8. a kind of comment identification device, which is characterized in that described device includes:

Comment obtains module, for obtaining comment to be identified；

Judgment module is commented on, there is no the words in illegal comment dictionary in the comment to be identified for working as, but there are doubtful non- When method comments on the word in dictionary, then the comment to be identified is input in comment identification model；

Comment on identification module, for exporting the comment type of the comment to be identified, the comment type include legal comment with Illegal comment.

9. a kind of computer readable storage medium, be stored with computer program makes when the computer program is executed by processor The processor is obtained to execute such as the step of any one of claims 1 to 7 the method.

10. a kind of computer equipment, including memory and processor, the memory is stored with computer program, the calculating When machine program is executed by the processor, so that the processor executes the step such as any one of claims 1 to 7 the method Suddenly.