CN110020147A

CN110020147A - Model generates, method for distinguishing, system, equipment and storage medium are known in comment

Info

Publication number: CN110020147A
Application number: CN201711225988.1A
Authority: CN
Inventors: 王颖帅; 李晓霞; 苗诗雨
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2017-11-29
Filing date: 2017-11-29
Publication date: 2019-07-16

Abstract

The invention discloses a kind of models to generate, method for distinguishing, system, equipment and storage medium are known in comment, and the method that wherein model generates is the following steps are included: S₁, obtain historical review data；S₂, every historical review data are labeled, to generate the first intermediate data, every first intermediate data includes historical review data and corresponding label, and label is comment spam or valuable comment；S₃, every first intermediate data is converted into historical review sequence；S₄, obtain feature, by historical review sequence and feature be input to Recognition with Recurrent Neural Network carry out model training, to generate the Classification and Identification model of comment spam.The present invention identifies Recognition with Recurrent Neural Network applied to comment spam, the Classification and Identification model for generating comment spam is trained using historical review data, new comment data to be identified is identified by the model to determine whether for comment spam, identification cost is reduced, the coverage and accuracy of comment spam identification are improved.

Description

Model generates, method for distinguishing, system, equipment and storage medium are known in comment

Technical field

The invention belongs to comment spams to identify field, the in particular to mould of a kind of comment spam based on Recognition with Recurrent Neural Network Type generates, method for distinguishing, system, equipment and storage medium are known in comment.

Background technique

With the development of internet and artificial intelligence, the quantity and influence power of online comment are continuously increased, and comment can be Multiple fields have an impact people, are even more important in internet area, and production can further be improved by effectively excavating user information Product；For a user, feedback of the user to the article for having had purchased article is understood by comment content, can help oneself The information such as advantage and disadvantage, the cost performance of article are understood in time, it is final that user is helped to make purchase decision.But comment is often flooded with More noise, some comments do not evaluate article itself, but have write some incoherent poems, some comments are Advertisement, link even have aggressive word, these comments are referred to as comment spam.The identification of comment spam is one and fills The work of full challenge and meaningful.

The identification that analytic hierarchy process (AHP) carries out comment spam is generallyd use in the prior art, is specifically passed through by analyst in conjunction with business Determining feature weight is tested, the calculation formula of comment score is then provided.The disadvantages of this method are: first, human cost is larger, needs The link for wanting expert to have product analysis teacher of business experience to give evaluation is more, does not meet the artificial intelligence of current era machine learning The trend of energy；Second, when data volume is larger, this method can be more multiple in order to guarantee that the accuracy of feature vector usually calculates It is miscellaneous；Third, analytic hierarchy process (AHP) are statistical thinkings, are that weight estimation is done on Small Sample Database, and qualitative ingredient is more, quantitative ingredient It is few, so calculated result is not accurate enough.

Summary of the invention

The technical problem to be solved by the present invention is in order to overcome the identification method of comment spam in the prior art, there are manpowers It is at high cost, calculate complicated and not high accuracy defect, provide a kind of identification accuracy for being able to ascend comment spam based on The model of the comment spam of Recognition with Recurrent Neural Network generates, method for distinguishing, system, equipment and storage medium are known in comment.

The present invention is to solve above-mentioned technical problem by following technical proposals:

The present invention provides a kind of methods that model generates, it is characterized in that, comprising the following steps:

S₁, obtain historical review data；

S₂, every historical review data are labeled, to generate the first intermediate data, every first intermediate data Including the historical review data and corresponding label, the label is comment spam or valuable comment；

S₃, every first intermediate data is converted into historical review sequence；

S₄, obtain feature, by the historical review sequence and the feature be input to Recognition with Recurrent Neural Network carry out model instruction Practice, to generate the Classification and Identification model of comment spam.

In the present solution, historical review data are comment lteral data, which recycles after need to being converted to comment sequence Neural network could identify to carry out model training.

It is comment spam with every historical review data of determination in the present solution, being labeled first to historical review data Or valuable comment, then the corresponding historical review sequence inputting of historical review data by feature and after marking is to recycling mind It is trained through network, ultimately generates the Classification and Identification model of comment spam.The Classification and Identification model can be used in subsequent new Comment data whether be comment spam identification.

In the present solution, Recognition with Recurrent Neural Network is identified applied to comment spam, life is trained using historical review data At the Classification and Identification model for being suitable for comment spam, to help whether subsequent new comment data for comment spam provides decision. By the Classification and Identification model can the new comment data of automatic identification whether be comment spam, be no longer dependent on artificial participation, To reduce the human cost of rubbish identification.

Preferably, the Recognition with Recurrent Neural Network is LSTM (Long Short-Term Memory, shot and long term memory network).

Preferably, step S₄Middle progress model training includes the steps that debugging core parameter, the core parameter include Batch_size, num_steps, vocab_size, hidden_units and learning_rate.

In the present solution, Recognition with Recurrent Neural Network uses LSTM, comment spam identification is influenced in numerous parameters included by LSTM The core parameter of accuracy includes batch_size, num_steps, vocab_size, hidden_units and learning_ rate.Wherein batch_size indicates the number of gradient decline iteration batch of data, i.e., training takes in training set every time Batch_size sample training, generally 2 multiple；Num_steps indicates the step number of deep learning, and value range is just whole Number；The size of vocab_size expression Recognition with Recurrent Neural Network word sliding window；Hidden_units indicates that deep learning is implicit Layer number；The learning rate of learning_rate expression deep neural network.

In the present solution, pre- to the progress propagated forward acquisition of each parameter by choosing training data inside Recognition with Recurrent Neural Network Measured value, and by backpropagation undated parameter, finally pick out the core parameter for influencing comment spam recognition accuracy and determination The value of each core parameter.

Preferably, batch_size be 64, num_steps 100, vocab_size 2, hidden_units 8, Learning_rate is 0.001.

Preferably, step S₄It is middle that the core ginseng is debugged using TensorFlow (second generation artificial intelligence learning system) Number.

In the present solution, specially multiple devices read the value of parameter simultaneously, and work as using Distributed T ensorFlow After back-propagation algorithm is completed, the value of synchronized update parameter, individual equipment will not be individually updated parameter, and can wait Backpropagation and then unified undated parameter are all completed to all devices, in each round iteration, distinct device obtains one at random Fraction data, the gradient of respective training parameter need to calculate difference after all devices complete the calculating of backpropagation The average value of parameter gradients in equipment, final updating parameter.

In the present solution, by Distributed T ensorFlow, more GPU (Graphics Processing Unit, graphics process Device) parallel training model, the historical review sequence of big data quantity is handled, processing speed faster, is able to ascend user experience.

Preferably, the method that the model generates further includes extracting feature to generate the feature.

Preferably, the feature includes the comment feature and user characteristics of commodity；

The comment feature of the commodity includes at least one of following characteristics:

That distance away from current time of the comment ranking score of commentator, comment creation time, comment score, comment thumb up Number, comment reply number, comment length, the picture number in comment, comment on whether have the Commercial goods labels for chasing after and commenting and including in commenting on Quantity；

The user characteristics include at least one of following characteristics:

User's gender, user's purchasing power grade, user member's level information and user's value point.

In the present solution, feature can be extracted from historical review data by Feature Engineering for Recognition with Recurrent Neural Network training mould Type uses.

Preferably, step S₁With step S₂Between further include LDA (Latent Dirichlet Allocation, Yi Zhongwen Shelves theme generate model) Subject Clustering, the LAD Subject Clustering the following steps are included:

T₁, every historical review data are converted into history feature vector；

T₂, obtain the feature, feature described in the history feature vector sum be input to LDA model carry out theme and gather Class, to obtain the quantity of the history feature vector under each classification of the LDA model；

T₃, one by one judge whether the quantity of the history feature vector under each classification is less than preset value, if then holding Row step T₄, T is thened follow the steps if not₅；

T₄, the history feature vector institute that the quantity of the history feature vector is less than under the classification of the preset value The corresponding historical review data are labeled, and to generate the second intermediate data, every second intermediate data includes institute Historical review data and corresponding label are stated, the label in second intermediate data is comment spam；

T₅, by the quantity of the history feature vector be greater than or equal to the preset value classification under the history feature The corresponding historical review data of vector are set as historical review data to be marked；

Step S₂Are as follows:

Every historical review data to be marked are labeled, to generate the first intermediate data, among every described first Data include the historical review data to be marked and corresponding label, and the label is comment spam or valuable comment；

Step S₃Are as follows: every first intermediate data and every second intermediate data are converted into historical review sequence Column.

In the present solution, carrying out LDA Subject Clustering after obtaining historical review data first, distribution includes seldom under Subject Clustering Comment classification under historical review data directly label as comment spam, other historical review data are using mark Afterwards to be determined as comment spam or valuable comment, then it is input in Recognition with Recurrent Neural Network and is trained again.

In the present solution, on the algorithm of machine learning, in conjunction with supervised learning and unsupervised learning, specially first by LDA Subject Clustering provides didactic rubbish tag recognition for deep learning training set, is then trained other data of collection again Historical review data after mark are finally input to Recognition with Recurrent Neural Network and carry out model training and ginseng by the mark of rubbish label Several optimization debugging.

In the present solution, being clustered first by LAD Subject Clustering to all historical review data, so can determine that A part of historical review data are comment spam, to reduce the workload of mark, while also improving the essence of comment spam Exactness.

Preferably, step S₁With step S₂Between it is further comprising the steps of:

Data cleansing is carried out to the historical review data；

Step S₂In every historical review data after data cleansing are labeled, to generate first mediant According to.

In the present solution, further including the data cleansing step of historical review data, the missing values number of comment can specifically include According to processing, the heuristic process of the outlier processing of comment data and comment data.The outlier processing of comment data, such as comment By picture normal condition within the scope of tens, having the picture of a comment once in a while is 10,000, it is believed that this comment picture Data are abnormal Value Datas, are disposed, which does not use；The heuristic process of comment data, such as have a comment, do not have There is any language, be all punctuation mark and number, being considered as this comment is comment spam, can also directly be labelled as rubbish Comment, subsequent feeding Recognition with Recurrent Neural Network use.

Preferably, step S₃The following steps are included:

S₃₁, using word2vec (a tool that word is characterized as to real number value vector) by first intermediate data calculate The vector of each word out；

S₃₂, to all words included by first intermediate data be averaging to generate the historical review sequence.

In the present solution, the first intermediate data to be calculated to the vector for generating word by word2vec, then again to all words Language is averaging to generate historical review sequence, so that the vector for the sentence being made of text being converted into mathematics is realized, it should Vector is used for follow-up process.

The present invention also provides the systems that a kind of model generates, it is characterized in that, including data acquisition module, the first label Labeling module, the first data conversion module and model training module；

The data acquisition module is for obtaining historical review data；

The first label for labelling module is for being labeled every historical review data, to generate the first mediant Include the historical review data and corresponding label according to, every first intermediate data, the label be comment spam or Valuable comment；

First data conversion module is used to every first intermediate data being converted into historical review sequence；

The model training module is input to circulation mind for obtaining feature, by the historical review sequence and the feature Model training is carried out through network, to generate the Classification and Identification model of comment spam.

Preferably, the model training module further includes core parameter debugging module；

For the core parameter debugging module for debugging core parameter, the core parameter includes batch_size, num_ Steps, vocab_size, hidden_units and learning_rate.

Preferably, the system that the model generates further includes characteristic extracting module, the characteristic extracting module is for generating The feature.

Preferably, the system that the model generates further includes LDA Subject Clustering module, the LDA Subject Clustering module packet Include the second data conversion module, cluster execution module, judgment module, the second label for labelling module and data setup module；

The data acquisition module is also used to call second data conversion module after obtaining historical review data；

Second data conversion module is used to every historical review data being converted to history feature vector, calls The cluster execution module；

Feature described in the history feature vector sum is input to by the cluster execution module for obtaining the feature LDA model carries out Subject Clustering, to obtain the quantity of the history feature vector under each classification of the LDA model, adjusts With the judgment module；

It is pre- that the judgment module is used to judge one by one whether the quantity of the history feature vector under each classification to be less than If value, if then calling the second label for labelling module, if otherwise calling the data setup module；

The second label for labelling module is used to be less than the quantity of the history feature vector classification of the preset value Under the corresponding historical review data of the history feature vector be labeled, to generate the second intermediate data, every institute Stating the second intermediate data includes the historical review data and corresponding label, and the label in second intermediate data is rubbish Comment；

The data setup module is used to for the quantity of the history feature vector being greater than or equal to the class of the preset value The corresponding historical review data of the history feature vector under not are set as historical review data to be marked；

The first label for labelling module is for being labeled every historical review data to be marked, to generate described the One intermediate data, every first intermediate data include the historical review data to be marked and corresponding label, the mark Label are comment spam or valuable comment；

First data conversion module is used for every first intermediate data and every second intermediate data It is converted into historical review sequence.

Preferably, the system that the model generates further includes data cleansing module；

The data acquisition module is also used to call the data cleansing module after obtaining historical review data；

The data cleansing module is used to carry out data cleansing to the historical review data；

The first label for labelling module is for being labeled every historical review data after data cleansing, to generate First intermediate data.

Preferably, first data conversion module includes word vectors generation module and evaluation sequence generating module；

The word vectors generation module is used to that first intermediate data to be calculated each word using word2vec The vector of language；

The evaluation sequence generating module is used to be averaging with life all words included by first intermediate data At the historical review sequence.

The present invention also provides the equipment that a kind of model generates, including memory, processor and storage are on a memory simultaneously The computer program that can be run on a processor, it is characterized in that, the processor realizes mould above-mentioned when executing described program The method that type generates.

The present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, it is characterized in that, The step of method that model above-mentioned generates is realized when described program is executed by processor.

The present invention also provides a kind of comments to know method for distinguishing, it is characterized in that, comprising the following steps:

L₁, obtain comment data to be identified；

L₂, the comment data to be identified is converted into comment sequence to be identified；

L₃, the step S of method that generates comment sequence inputting to the model above-mentioned to be identified₄Classification generated Identification model；

L₄, the Classification and Identification model judge that the comment data to be identified corresponding to the comment sequence to be identified is No is comment spam.

In the present solution, comment data to be identified is new comment data to be identified, data input is aforementioned according to history Comment spam or valuable comment can be directly identified as after the Classification and Identification model that comment data generates.This programme provides Comment know method for distinguishing can automatic identification comment data to be identified whether be comment spam, reduce identification cost, promoted The coverage and accuracy of comment spam identification.In addition, comment spam is effectively recognized, show user is all to have reference The comment of value, thus further improves user experience.

The present invention also provides a kind of systems of comment identification, it is characterized in that, including data acquisition module to be identified, sequence The system that column-generation module, input module and model above-mentioned generate；

The data acquisition module to be identified is for obtaining comment data to be identified；

The sequence generating module is used to the comment data to be identified being converted into comment sequence to be identified；

The input module is used for the comment sequence inputting to be identified to the Classification and Identification model；

The Classification and Identification model is for judging the comment data to be identified corresponding to the comment sequence to be identified It whether is comment spam.

The present invention also provides a kind of equipment of comment identification, including memory, processor and storage are on a memory simultaneously The computer program that can be run on a processor, it is characterized in that, the processor realizes above-mentioned comment when executing described program By knowledge method for distinguishing.

The present invention also provides a kind of computer readable storage mediums, are stored thereon with computer program, it is characterized in that, The step of method for distinguishing is known in comment above-mentioned is realized when described program is executed by processor.

The positive effect of the present invention is that: model provided by the invention generates, method for distinguishing is known in comment, system, sets Standby and storage medium identifies Recognition with Recurrent Neural Network applied to comment spam, is trained generation using historical review data and is applicable in New comment data to be identified is identified with true in the Classification and Identification model of comment spam, then by the Classification and Identification model Whether fixed is comment spam.The present invention can automatic identification comment data to be identified whether be comment spam, reduce and be identified as This, improves the coverage and accuracy of comment spam identification.In addition, comment spam is effectively recognized, user is showed all It is the comment for having reference value, thus further improves user experience.

Detailed description of the invention

Fig. 1 is the flow chart for the method that the model of the embodiment of the present invention 1 generates.

Fig. 2 is the flow chart of step 108 in the embodiment of the present invention 1.

Fig. 3 is the flow chart of distributed training pattern in the embodiment of the present invention 1.

Fig. 4 is the module diagram for the system that the model of the embodiment of the present invention 2 generates.

Fig. 5 is the hardware structural diagram for the equipment that the model of the embodiment of the present invention 4 generates.

Fig. 6 is that the flow chart of method for distinguishing is known in the comment of the embodiment of the present invention 5.

Fig. 7 is the module diagram of the system of the comment identification of the embodiment of the present invention 6.

Specific embodiment

The present invention is further illustrated below by the mode of embodiment, but does not therefore limit the present invention to the reality It applies among a range.

Embodiment 1

As shown in Figure 1, the method that model provided in this embodiment generates the following steps are included:

Step 101 obtains historical review data；

Step 102 carries out data cleansing to the historical review data；

Step 103 carries out feature extraction to the historical review data after data cleansing to obtain feature；

The historical review data after the cleaning of every data are converted to history feature vector by step 104；

Feature described in the history feature vector sum is input to LDA model progress Subject Clustering by step 105, to obtain The quantity of the history feature vector under each classification of the LDA model；

Step 106 judges whether the quantity of the history feature vector under each classification is less than preset value one by one, if Then the historical review data corresponding to the history feature vector under the category are labeled, to generate among second Data, every second intermediate data include the historical review data and corresponding label, second mediant The value of label in is comment spam；If otherwise by the corresponding historical review of the history feature vector under the category Data are set as historical review data to be marked；

Step 107 is labeled every historical review data to be marked, to generate the first intermediate data, described in every First intermediate data includes the historical review data to be marked and corresponding label, and the label is comment spam or valuable Comment；

Every first intermediate data and every second intermediate data are converted into historical review sequence by step 108 Column；

The historical review sequence and the feature are input to LSTM progress model training by step 109, are used TensorFlow debugs core parameter, to generate the Classification and Identification model of comment spam.

Wherein, the corresponding process of step 108 is as shown in Figure 2, comprising the following steps:

Step 1081, using word2vec, first intermediate data and second intermediate data are counted respectively It calculates, to obtain the vector of each word；

Step 1082 asks flat to all words included by first intermediate data and second intermediate data respectively To generate the historical review sequence.

In the present embodiment, core parameter includes batch_size, num_steps, vocab_size, hidden_units and learning_rate。

Wherein, batch_size indicates the number of gradient decline iteration batch of data in LSTM, i.e., training is being instructed every time Practicing to concentrate takes batch_size sample to be trained, and batch_size value is 64 in the present embodiment.

Num_steps indicates the step number of deep learning in LSTM, and value is 100 in the present embodiment.

Vocab_size indicates the size of Recognition with Recurrent Neural Network word sliding window in LSTM, and value is in the present embodiment 2。

Hidden_units indicates deep learning hidden layer number in LSTM, and value is 8 in the present embodiment.

Learning_rate indicates the learning rate of deep neural network in LSTM, and value is in the present embodiment 0.001。

Core parameter data_dir, ps_hosts, the worker_hosts in five LSTM are further related in the present embodiment, Job_name, tf.device, these parameters are directly specified in the present embodiment.Wherein data_dir indicates training data path, this It is divided into training set, verifying collection, test set in embodiment；Ps_hosts indicates that Distributed T ensorFlow cluster is responsible for receiving parameter Machine；Worker_hosts indicates that Distributed T ensorFlow cluster is responsible for calculating the machine of training pattern；Job_name is used In the title for an application task for indicating training pattern starting；Tf.device uses GPU still for specifying in training process CPU (Central Processing Unit, central processing unit).

In the present embodiment, the feature includes the comment feature and user characteristics of commodity；The comment feature packet of the commodity Include commentator comment ranking score, distance of the comment creation time away from current time, comment score (including favorable comment, in comment and poor Comment), number is replied in the number that thumbs up of comment, comment, whether comment length, the picture number in comment, comment chase after and comment and comment on In include Commercial goods labels quantity；The user characteristics include user's gender, user's purchasing power grade, user member's rank letter Breath and user's value point.Whether the user that wherein user member's level information is similar to Jingdone district is a kind of plus (meeting in Jingdone district Member's grade) member.

In the present embodiment, data cleansing is carried out to historical review data first, data cleansing specifically includes the missing of comment Value Data processing, the outlier processing of comment data and comment data heuristic process.The outlier processing of comment data, example Picture normal condition is such as commented within the scope of tens, having the picture of a comment once in a while is 10,000, it is believed that this comment Image data is abnormal Value Data, is disposed, which does not use；The heuristic process of comment data, such as there is one to comment By being all punctuation mark and number without any language, being considered as this comment is comment spam, can also directly be labelled For comment spam, subsequent feeding Recognition with Recurrent Neural Network is used.

In the present embodiment, pass through LAD master after being converted to history feature vector for the historical review data after data cleansing Topic is clustered, specifically, distribution under Subject Clustering is straight including the historical review data under the classification of seldom comment Taking label is comment spam, and other historical review data are set as after historical review data to be marked using mark with true The fixed data are comment spam or valuable comment, are then input in Recognition with Recurrent Neural Network and are trained again.Thus, it is possible to Directly determining a part of historical review data is comment spam, to reduce the workload of mark, while also improving rubbish The accuracy of comment.

In the present embodiment, data cleansing is successively passed through for historical review data, LDA Subject Clustering is marked and reset Afterwards, it also needs manually to be marked for being set as historical review data to be marked, to determine each historical review to be marked Data are comment spam or valuable comment, then data conversion are carried out to the first intermediate data after mark again, by text Data are converted to the vector that LSTM is capable of handling, specially first by word2vec respectively in the first intermediate data and second Between data meter calculated, the vector of corresponding word can be generated by calculating, then again to the institute of user one comment the inside There is word averaging, historical review sequence, that is, the vector of a word is generated, to realize the sentence being made of text It is converted into the vector of mathematics, feature and historical review sequence inputting are trained again to LSTM then, used TensorFlow debugs core parameter, ultimately generates the Classification and Identification model of comment spam.The Classification and Identification model is for subsequent New comment data whether be comment spam identification.

In the present embodiment, model training is carried out using Distributed T ensorFlow, as shown in figure 3, simultaneously for multiple devices The value of parameter is read, and after back-propagation algorithm is completed, the value of synchronized update parameter, individual equipment will not be independent Parameter is updated, and all devices can be waited all to complete backpropagation and then unified undated parameter, in each round iteration When, distinct device obtains sub-fraction data, the gradient of respective training parameter, when all devices complete the meter of backpropagation at random After calculation, need to calculate the average value of parameter gradients on distinct device, final updating parameter.

In the present embodiment, by Distributed T ensorFlow, more GPU parallel training models handle the history of big data quantity Sequence is commented on, processing speed faster, is able to ascend user experience.

In the present embodiment, Recognition with Recurrent Neural Network is identified applied to comment spam, Recognition with Recurrent Neural Network uses LSTM, utilizes Historical review data are trained the Classification and Identification model for generating and being suitable for comment spam, to help subsequent new comment data to be Identify whether that comment spam provides decision.By the Classification and Identification model can the new comment data of automatic identification whether be rubbish Comment, is no longer dependent on artificial participation, to reduce the human cost of rubbish identification.

In the present embodiment, on the algorithm of machine learning, in conjunction with supervised learning and unsupervised learning, specially first by LDA Subject Clustering provides didactic rubbish tag recognition for deep learning training set, is then trained other numbers of collection again According to rubbish label mark, finally by the historical review data after mark be input to Recognition with Recurrent Neural Network carry out model training with And the optimization debugging of parameter.

Embodiment 2

As shown in figure 4, the system that the model of the present embodiment generates includes data acquisition module 1, data cleansing module 2, spy Levy extraction module 3, LDA Subject Clustering module 4, the first label for labelling module 5, the first data conversion module 6 and model training mould Block 7；

The model training module 7 further includes core parameter debugging module 701；

The LDA Subject Clustering module 4 includes the second data conversion module 401, cluster execution module 402, judgment module 403, the second label for labelling module 404 and data setup module 405；

First data conversion module 6 includes word vectors generation module 601 and evaluation sequence generating module 602；

For obtaining historical review data, the data acquisition module 1 is also used to go through in acquisition the data acquisition module 1 The data cleansing module 2 is called after history comment data；

The data cleansing module 2 is used to carry out data cleansing to the historical review data；

The characteristic extracting module 3 is used to carry out feature extraction to the historical review data after data cleansing to obtain feature, Call second data conversion module 401；

Second data conversion module 401 is used to the historical review data after the cleaning of every data being converted to history spy Vector is levied, the cluster execution module 402 is called；

The cluster execution module 402 is used to for feature described in the history feature vector sum to be input to LDA model and carry out Subject Clustering calls the judgement mould to obtain the quantity of the history feature vector under each classification of the LDA model Block 403；

The judgment module 403 is used to judge whether the quantity of the history feature vector under each classification is less than one by one Preset value, if then calling the second label for labelling module 404, if otherwise calling the data setup module 405；Described Two label for labelling modules 404 are used to be less than the quantity of the history feature vector history under the classification of the preset value The corresponding historical review data of feature vector are labeled, to generate the second intermediate data, every second mediant According to including the historical review data and corresponding label, the label in second intermediate data is comment spam；The number It is gone through described in being used to for the quantity of the history feature vector being greater than or equal to according to setup module 405 under the classification of the preset value The corresponding historical review data of history feature vector are set as historical review data to be marked；The judgment module 403 is handled The first label for labelling module 5 is called after history feature vector under complete all categories；

The first label for labelling module 5 is for being labeled every historical review data to be marked, to generate first Intermediate data, every first intermediate data include the historical review data to be marked and corresponding label, the label For comment spam or valuable comment, the first label for labelling module 5 calls first data conversion module after having marked 6；

First data conversion module 6 is used for every first intermediate data and every second intermediate data It is converted into historical review sequence and calls the model training module 7.

The model training module 7 be used for by the historical review sequence and the feature be input to Recognition with Recurrent Neural Network into Row model training, the model training module 701 debugs the core parameter using TensorFlow, to generate comment spam Classification and Identification model.

The word vectors generation module 601 is for being respectively adopted word2vec for first intermediate data and described the Two intermediate data calculate the vector of each word and call the evaluation sequence generating module 602；

The evaluation sequence generating module 602 is used for respectively to first intermediate data and the second intermediate data institute Including all words be averaging to generate the historical review sequence.

In the present embodiment, the Recognition with Recurrent Neural Network uses LSTM.

In the present embodiment, core parameter includes batch_size, num_steps, vocab_size, hidden_units and learning_rate.It is 100, vocab_size value is 2, hidden_ that batch_size value, which is 64, num_steps value, Units value is that 8, learning_rate value is 0.001.

In the present embodiment, the feature includes the comment feature and user characteristics of commodity；The comment feature packet of the commodity It includes the comment ranking score of commentator, the number that distance of the comment creation time away from current time, comment score, comment thumb up, comment By reply number, comment length, the picture number in comment, comment whether have chase after comment and comment in include Commercial goods labels number Amount；The user characteristics include user's gender, user's purchasing power grade, user member's level information and user's value point.

The system that model provided in this embodiment generates identifies Recognition with Recurrent Neural Network applied to comment spam, circulation mind LSTM is used through network, the Classification and Identification model for generating and being suitable for comment spam can be trained using historical review data, It is to identify whether that comment spam provides decision to help subsequent new comment data.It can be known automatically by the Classification and Identification model Whether not new comment data is comment spam, is no longer dependent on artificial participation, to reduce the human cost of rubbish identification.

Embodiment 3

Fig. 5 is the structural schematic diagram for the equipment that a kind of model that the embodiment of the present invention 3 provides generates.Fig. 5, which is shown, to be suitable for For realize embodiment of the present invention exemplary model generate equipment 50 block diagram.The equipment that the model that Fig. 5 is shown generates 50 be only an example, should not function to the embodiment of the present invention and use scope bring any restrictions.

As shown in figure 5, the equipment 50 that model generates can be showed in the form of universal computing device, such as it can be clothes Business device equipment.The component of equipment 50 that model generates can include but is not limited to: at least one above-mentioned processor 51, it is above-mentioned at least One memory 52, the bus 53 for connecting different system components (including memory 52 and processor 51).

Bus 53 includes data/address bus, address bus and control bus.

Memory 52 may include volatile memory, such as random access memory (RAM) 521 and/or cache Memory 522 can further include read-only memory (ROM) 523.

Memory 52 can also include program/utility 525 with one group of (at least one) program module 524, this The program module 524 of sample includes but is not limited to: operating system, one or more application program, other program modules and journey It may include the realization of network environment in ordinal number evidence, each of these examples or certain combination.

Processor 51 by the computer program that is stored in memory 52 of operation, thereby executing various function application and Data processing, such as the method that model provided by the embodiment of the present invention 1 generates.

The equipment 50 that model generates can also be logical with one or more external equipments 54 (such as keyboard, sensing equipment etc.) Letter.This communication can be carried out by input/output (I/O) interface 55.Also, the equipment 50 that model generates can also pass through net Network adapter 56 and one or more network (such as local area network (LAN), wide area network (WAN) and/or public network, such as because Special net) communication.As shown, the other modules for the equipment 50 that network adapter 56 is generated by bus 53 and model communicate.It answers When understand, although not shown in the drawings, can with binding model generate equipment 50 use other hardware and/or software module, including But it is not limited to: microcode, device driver, redundant processor, external disk drive array, RAID (disk array) system, magnetic Tape drive and data backup storage system etc..

It should be noted that although being referred to several units/modules or son of the equipment of model generation in the above detailed description Units/modules, but it is this division be only exemplary it is not enforceable.In fact, according to presently filed embodiment, The feature and function of two or more above-described units/modules can embody in a units/modules.On conversely, The feature and function of one units/modules of text description can be to be embodied by multiple units/modules with further division.

Embodiment 4

A kind of computer readable storage medium is present embodiments provided, computer program, described program quilt are stored thereon with The step of method that model provided by embodiment 1 generates is realized when processor executes.

Embodiment 5

As shown in fig. 6, the present embodiment comment know method for distinguishing the following steps are included:

Step M1, comment data to be identified is obtained；

Step M2, the comment data to be identified is converted into comment sequence to be identified；

Step M3, the step 109 for the method for generating comment sequence inputting to the model described in embodiment 1 to be identified Classification and Identification model generated；

Step M4, the described Classification and Identification model judges the comment number to be identified corresponding to the comment sequence to be identified According to whether being comment spam.

In the present embodiment, comment data to be identified is new comment data to be identified, which inputs 1 basis of embodiment Comment spam or valuable comment can be directly identified as after the Classification and Identification model that historical review data generate.

Comment provided in this embodiment know method for distinguishing can automatic identification comment data to be identified whether be comment spam, Identification cost is reduced, the coverage and accuracy of comment spam identification are improved.In addition, comment spam is effectively recognized, exhibition Show that user be all the comment for having reference value, thus further improves user experience.

Embodiment 6

As shown in fig. 7, the system that a kind of comment of the present embodiment identifies, including data acquisition module to be identified 8, sequence are raw The system 11 generated at module 9, input module 10 and model as described in example 2；

The data acquisition module to be identified 8 is for obtaining comment data to be identified；

The sequence generating module 9 is used to the comment data to be identified being converted into comment sequence to be identified；

The input module 10 is used for the comment sequence inputting to be identified to the Classification and Identification model；

It is provided in this embodiment comment identification system can automatic identification comment data to be identified whether be comment spam, Identification cost is reduced, the coverage and accuracy of comment spam identification are improved.In addition, comment spam is effectively recognized, exhibition Show that user be all the comment for having reference value, thus further improves user experience.

Embodiment 7

The equipment for present embodiments providing a kind of comment identification, including memory, processor and storage are on a memory simultaneously The computer program that can be run on a processor, the processor realize comment provided by embodiment 5 when executing described program Know method for distinguishing.

Embodiment 8

A kind of computer readable storage medium is present embodiments provided, computer program, described program quilt are stored thereon with The step of method for distinguishing is known in comment provided by embodiment 5 is realized when processor executes.

Although specific embodiments of the present invention have been described above, it will be appreciated by those of skill in the art that this is only For example, protection scope of the present invention is to be defined by the appended claims.Those skilled in the art without departing substantially from Under the premise of the principle and substance of the present invention, many changes and modifications may be made, but these change and Modification each falls within protection scope of the present invention.

Claims

1. a kind of method that model generates, which comprises the following steps:

S₁, obtain historical review data；

S₂, every historical review data are labeled, to generate the first intermediate data, every first intermediate data includes The historical review data and corresponding label, the label are comment spam or valuable comment；

S₄, obtain feature, by the historical review sequence and the feature be input to Recognition with Recurrent Neural Network carry out model training, with Generate the Classification and Identification model of comment spam.

2. the method that model as described in claim 1 generates, which is characterized in that the Recognition with Recurrent Neural Network is LSTM.

3. the method that model as described in claim 1 generates, which is characterized in that step S₄Middle progress model training includes debugging The step of core parameter, the core parameter include batch_size, num_steps, vocab_size, hidden_units and learning_rate。

4. the method that model as claimed in claim 3 generates, which is characterized in that batch_size 64, num_steps are 100, vocab_size 2, hidden_units 8, learning_rate 0.001.

5. the method that model as claimed in claim 3 generates, which is characterized in that step S₄It is middle that institute is debugged using TensorFlow State core parameter.

6. the method that model as described in claim 1 generates, which is characterized in that the method that the model generates further includes extracting Feature is to generate the feature.

7. the method that model as claimed in claim 6 generates, which is characterized in that the feature include commodity comment feature and User characteristics；

Number that distance away from current time of the comment ranking score of commentator, comment creation time, comment score, comment thumb up, Number is replied in comment, whether comment length, the picture number in comment, comment have to chase after comments and the number of the Commercial goods labels that include in commenting on Amount；

The user characteristics include at least one of following characteristics:

8. the method that model as claimed in claim 7 generates, which is characterized in that step S₁With step S₂Between further include LDA master Topic cluster, the LAD Subject Clustering the following steps are included:

T₁, every historical review data are converted into history feature vector；

T₂, obtain the feature, feature described in the history feature vector sum is input to LDA model and carries out Subject Clustering, with The quantity of the history feature vector under to each classification of the LDA model；

T₃, one by one judge whether the quantity of the history feature vector under each classification is less than preset value, if so then execute step T₄, T is thened follow the steps if not₅；

T₄, corresponding to the history feature vector that is less than under the classification of the preset value to the quantity of the history feature vector The historical review data be labeled, to generate the second intermediate data, every second intermediate data includes described goes through History comment data and corresponding label, the label in second intermediate data are comment spam；

T₅, by the quantity of the history feature vector be greater than or equal to the preset value classification under the history feature vector The corresponding historical review data are set as historical review data to be marked；

Step S₂Are as follows:

Every historical review data to be marked are labeled, to generate first intermediate data, among every described first Data include the historical review data to be marked and corresponding label, and the label is comment spam or valuable comment；

Step S₃Are as follows: every first intermediate data and every second intermediate data are converted into historical review sequence.

9. the method that model as described in claim 1 generates, which is characterized in that step S₁With step S₂Between further include following Step:

Data cleansing is carried out to the historical review data；

Step S₂In every historical review data after data cleansing are labeled, to generate first intermediate data.

10. the method that model as described in claim 1 generates, which is characterized in that step S₃The following steps are included:

S₃₁, first intermediate data is calculated to using word2vec the vector of each word；

11. the system that a kind of model generates, which is characterized in that including data acquisition module, the first label for labelling module, the first number According to conversion module and model training module；

The data acquisition module is for obtaining historical review data；

The first label for labelling module is for being labeled every historical review data, to generate the first intermediate data, often First intermediate data described in item includes the historical review data and corresponding label, and the label is comment spam or valuable Comment；

The historical review sequence and the feature are input to circulation nerve net for obtaining feature by the model training module Network carries out model training, to generate the Classification and Identification model of comment spam.

12. the system that model as claimed in claim 11 generates, which is characterized in that the model training module further includes core Parameter testing module；

13. the system that model as claimed in claim 11 generates, which is characterized in that the system that the model generates further includes spy Extraction module is levied, the characteristic extracting module is for generating the feature.

14. the system that model as claimed in claim 13 generates, which is characterized in that the system that the model generates further includes LDA Subject Clustering module, the LDA Subject Clustering module include the second data conversion module, cluster execution module, judge mould Block, the second label for labelling module and data setup module；

Second data conversion module is used to every historical review data being converted to history feature vector, described in calling Cluster execution module；

Feature described in the history feature vector sum is input to LDA mould for obtaining the feature by the cluster execution module Type carries out Subject Clustering, to obtain the quantity of the history feature vector under each classification of the LDA model, described in calling Judgment module；

The judgment module is used to judge whether the quantity of the history feature vector under each classification is less than preset value one by one, If then calling the second label for labelling module, if otherwise calling the data setup module；

The second label for labelling module is used to be less than under the classification of the preset value quantity of the history feature vector The corresponding historical review data of the history feature vector are labeled, to generate the second intermediate data, every described Two intermediate data include the historical review data and corresponding label, and the label in second intermediate data is commented for rubbish By；

The data setup module is used to for the quantity of the history feature vector being greater than or equal under the classification of the preset value The corresponding historical review data of the history feature vector be set as historical review data to be marked；

The first label for labelling module is for being labeled every historical review data to be marked, to generate in described first Between data, every first intermediate data includes the historical review data to be marked and corresponding label, and the label is Comment spam or valuable comment；

First data conversion module is used to convert every first intermediate data and every second intermediate data At historical review sequence.

15. the system that model as claimed in claim 11 generates, which is characterized in that the system that the model generates further includes number According to cleaning module；

The first label for labelling module is for being labeled every historical review data after data cleansing, described in generating First intermediate data.

16. the system that model as claimed in claim 11 generates, which is characterized in that first data conversion module includes word Language vector generation module and evaluation sequence generating module；

The word vectors generation module is used to that first intermediate data to be calculated each word using word2vec Vector；

The evaluation sequence generating module is used to be averaging to generate all words included by first intermediate data State historical review sequence.

17. the equipment that a kind of model generates, including memory, processor and storage can be run on a memory and on a processor Computer program, which is characterized in that the processor realizes that claims 1 to 10 is described in any item when executing described program The method that model generates.

18. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed The step of method that the described in any item models of claims 1 to 10 generate is realized when device executes.

19. method for distinguishing is known in a kind of comment, which comprises the following steps:

L₁, obtain comment data to be identified；

L₃, the comment sequence inputting to be identified generated to the described in any item models of claims 1 to 10 method the step of S₄Classification and Identification model generated；

L₄, the Classification and Identification model judge the comment data to be identified corresponding to the comment sequence to be identified whether be Comment spam.

20. a kind of system of comment identification, which is characterized in that including data acquisition module to be identified, sequence generating module, input The system that module and the described in any item models of claim 11 to 16 generate；

Whether the Classification and Identification model is for judging the comment data to be identified corresponding to the comment sequence to be identified For comment spam.

21. a kind of equipment of comment identification, including memory, processor and storage can be run on a memory and on a processor Computer program, which is characterized in that the processor realizes the identification of comment described in claim 19 when executing described program Method.

22. a kind of computer readable storage medium, is stored thereon with computer program, which is characterized in that described program is processed The step of method for distinguishing is known in comment described in claim 19 is realized when device executes.