CN110390018A

CN110390018A - A kind of social networks comment generation method based on LSTM

Info

Publication number: CN110390018A
Application number: CN201910680645.7A
Authority: CN
Inventors: 何慧; 张伟哲; 方滨兴; 邰煜
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2019-07-25
Filing date: 2019-07-25
Publication date: 2019-10-29

Abstract

A kind of social networks comment generation method based on LSTM, belongs to social networks comment generation technique field.The present invention be understand existing social networks comment on scene applied by generation technique be excessively narrow it is single, the problem of material database is provided can not be drawn to public sentiment.The present invention uses the NLG technology learnt based on LSTM, and come the vision to sentence structure, the semantic, type of character and each character encode the probabilistic relation between each character obtained by study.Fusion in terms of having carried out semantic and syntax to the comment information being intended by, and the later period passes through the methods of specific word replacement, generation and almost consistent lively, clear and coherent, the changeful high quality reviews text of social networks.The present invention provides advantageous material corpus for public sentiment guidance, by propagating more true, trustworthy speeches, restores the network environment of positive energy.The present invention can be used as material corpus and be input in the system of existing public sentiment guidance, generate for the comment of social networks specific area.

Description

A kind of social networks comment generation method based on LSTM

Technical field

The present invention relates to a kind of, and the social networks based on LSTM comments on generation method, belongs to social networks comment generation technique Field.

Background technique

Nowadays, online social network-i i-platform be greatly promoted the life of netizen with exchange, occurrences in human life object all over the world because It is closely connected for network, people are higher and higher to the participation of network event, generate numerous social networks comments therefrom. Comment represents a kind of language, a kind of sound, is ideological reflection, text sentence is concise, be intended to clear, various structures, It is the ideal place of test text Auto.Certain user can be expressed for a certain focus incident by being posted by The view and position of oneself, or approve of, or neutral, or negative, or be the network rumour under a kind of driving of interests.This patent is by society It hands over network locked in Twitter platform, collects political, health on this platform, education, amusement, user's life in five fields of science and technology At content (User Generated Content, UGC), that mainly completes the comment text in Twitter automatically generates work. Text Auto belongs to one of research field of artificial intelligence, its main thought is according to will input computer-internal Information, the Analysis of Deep Implications for the information of being generated is cooked up according to the text planning device of computer-internal, then device is realized by text This meaning is converted to the language construction of grammatical, is exported in the form of comment text.Artificial intelligence is current scientific development Hot spot, text generation technology has gradually obtained the concern of people, while the application ten of text generation technology in real life Divide extensively, strong influence is brought to human work and life.Existing literature CN108256968A discloses a kind of electric business platform Commodity comment of experts generation method, the document propose that a kind of comment of experts for generating model based on sequence to sequence is summarized and generate skill Art extracts the important information in all user comments of certain commodity, generates the language of one section of summing-up to describe the characteristic of commodity.Disappear Expense person can understand the advantage and disadvantage of commodity according to the comment of experts of generation, consider whether to buy；Businessman can be according to generation Comment of experts improves oneself commodity.The present invention can be extracted with the important comment for representing product characteristics, can be quotient Family improves commodity and provides reference well, and businessman is allowed to promote the user experience of product, improves sales volume, additional income.Its energy simultaneously Purchase reference enough is provided for consumer, promotes the shopping experience of consumer；It may also help in electric business platform and attract more viscosity User expands the influence power of itself.The document does not propose to generate comment by deep learning.

From the point of view of the general situation of development that domestic and international natural language text generates, existing spatial term technology, in social activity The comment of network specific area generates aspect, has the following problems.

(1) spatial term research aspect has had the model of very more comparative maturities.Natural language text generates It focuses mostly in interactive system, machine translation, information retrieval, text classification, automatic abstract, the target text of research is mostly rule The text collection of model, or the article to deliver, or be the data set of disclosed specification, it is related to social networks comment text generation side The research in face is seldom.

(2) the way of production of network comment is varied, social networks (such as Facebook, Twitter, Weibo, RenRen), e-commerce (such as Amazon, Alibaba, Dangdang.com), mail service (such as Gmail, Yahoo, E-mail), net Network forum (such as ends of the earth, Netease, bean cotyledon).It is related to the more of the comment text generation of e-commerce and mail service platform at present, And there is no the generations that a kind of pervasive method realizes social networks comment text.

(3) the existing model generated for comment text, the language mode fixed single of research.On Twitter platform Comment be directed to different field focus incident, event have more it is sudden, thoughts i.e. send out, be mostly with first impression it is leading, Comment has randomness more, and diversity, rule is difficult to capture.Although the comment generation on Yelp social network sites achieves plentiful and substantial Achievement, but it only relates to user in the comment mode of public comment, it then follows a fairly standard structure is with experience Leading, without preamble, link closely theme, comments in crucial point, and mostly being disliked with happiness is leading, mode fixed single, on the whole application scenarios Excessively stable, rule is easy to capture.And the and comment that is not suitable on Twitter platform.

Summary of the invention

The technical problem to be solved in the present invention:

The present invention be understand existing social networks comment on scene applied by generation technique be excessively narrow it is single, can not be right Public sentiment draws the problem of providing material database, and then proposes a kind of social networks comment generation method based on LSTM.

The present invention solve above-mentioned technical problem the technical solution adopted is that:

A kind of social networks comment generation method based on LSTM, which comprises

Classify to comment text, comment text is divided into seven kinds of classifications: principal series table structure, compares level structure, query Sentence structure, exclamative sentence structure, Subject, Predicate and Object structure, Subject, Predicate and Object guest mend structure, imperative sentence structure；For different classifications, design is not Same LSTM model, obtains Probability Structure by the study to each LSTM model, generates different classes of initial comment IR_i, Subscript i indicate seven kinds of classifications, i be equal to 1,2,3 ... 7；

According to it is different classes of itself the characteristics of, formulate corresponding text-processing strategy to correct corresponding LSTM model, And then it generates and really comments on consistent high quality reviews text FR with social networks_i；

A specific area D is given under social networks W, the hot topic collection which includes is combined into T={ T₁,T₂,…, T_n, select a certain topic T_i, for topic T_iA certain specific main patch P is selected, the comment text collection under main patch P is crawled, is expressed as RR={ RR₁,RR₂,…,RR_n, by classification processing, different classes of data are filtered out, are denoted as FlR={ FlR₁, FlR₂,…,FlR_n, it is separately input in the LSTM model of different parameters, generates the initial comment collection of respective classes, be expressed as IR ={ IR₁,IR₂,…,IR_n}；For different classes of feature respectively to IR_iDifferent strategies is formulated to carry out drift correction, is given birth to FR={ FR is expressed as at final comment collection₁,FR₂,…,FR_n, as shown in formula (1) to (3)；Wherein i ∈ { 1,2,3 ..., n }, N is equal to 7, and wherein C function represents the text classification process based on Random Forest model, and it is raw that function h represents the text based on LSTM At process, function z_jRepresent drift correction process；Drift correction includes text replacement, text is repeated and three kinds of plans of model customization Slightly；

C(RR)→FlR_i (1)

h(W,D,T,P,FlR_i)→IR_i (2)

z_j(IR_i)→FR_i j∈{1,2,3} (3)。

It further, is (i.e. based on random forest mould to the process that comment text is classified based on Random Forest model The sentence structure of type is classified):

Data set is created by crawler first, is segmented, part-of-speech tagging, sentence dividing processing,

Secondly the operation that feature extraction and extraction are carried out to text, obtains the feature vector for indicating text, input with In machine forest model, the output of text classification is obtained by random forest training, comment text is finally divided into following seven class: Principal series table structure compares level structure, interrogative sentence structure, exclamative sentence structure, Subject, Predicate and Object structure, Subject, Predicate and Object guest benefit structure, imperative sentence Structure；

Feature Dimension Reduction is carried out using feature selecting and feature extraction in stating feature extraction, the feature of selection is identified Symbol expression, the specific features of extraction are as follows:

(1) Word embedding+tf-idf feature

Vectorization is carried out to each word by selecting word embedding vector obtained by CBOW model training； Shown in the objective function of CBOW such as formula (4)；For word w_t, word context is Context (w_t)={ w_t-b,..., w_t-1,w_t+1,...,w_t+b}；Wherein, b constant is used to determine the contextual window size of word, window size b=4, final vector The word dimension of change is determined as 120；Simultaneously introduce tf-idf feature selection approach to each word embedding vector into Row is considered, and is obtained word embedding+tf-idf feature, is named as Wetfidf；

(2) WfreMatrix (word frequency matrix) feature

WfreMatrix feature is used to indicate that frequency matrix, row indicate the number of article, and column indicate institute in comment text The word occurred, the element in matrix are the frequency that word occurs；Word frequency statistics are calculated by the interface that sklearn is provided It completes；

(3) Pos (Part of speech) feature

Part-of-speech tagging is carried out to comment text using pos_tag_sents function in NLTK tool；

(4) Key feature

Key feature is used to indicate that keyword feature, keyword are the words for referring to represent a classification, and Key is calculated as public Formula (5),

(5) Index feature

Index feature is used to indicate the feature of word position sequence；

(6) Punc feature

Punc feature is used to indicate punctuation mark feature, be counted by the position sequence to fixed punctuation mark sequence to take out Take punc feature.

This above-mentioned process illustrates formula (1) is how to realize.

Further, for different classifications, different LSTM models is designed, learning to each LSTM model is passed through To Probability Structure, different classes of initial comment IR is generated_i(comment text based on LSTM automatically generates), process are as follows:

Building is encoded-is decoded the text generation model of structure based on LSTM, is given a Twitter and is commented on short text, first First the text of input is encoded, in decoding stage, uses the probability of the candidate characters of the LSTM different moments generated one by one Distribution, then it is next to select and determine by a kind of reasonable sampling techniques (such as greedy sampling, random sampling, Beam are searched for) The character of a appearance, the character string of composition constitute the natural language description to input semantic item；Coding side is equal with decoding end It is made of single layer LSTM；A context vector C is generated in coding stage, input of this vector as decoding stage is decoding End exports final sequence data.This above-mentioned process illustrates formula (2) is how to realize.

Below this 4 corresponding technical solution of power and 5 corresponding technical solutions of power come together to illustrate formula (3) be as What what was realized.

Further, according to it is different classes of itself the characteristics of, it is corresponding to correct to formulate corresponding text-processing strategy LSTM model, and then generate and really comment on consistent high quality reviews text FR with social networks_i(the text based on domain knowledge Drift correction technology), process are as follows:

For topic (event) T_iItself collects relatively comprehensive priori knowledge, i.e. domain knowledge, for theme Unrelated or low correlation comment, and the comment runed counter to the fact carry out drift correction processing, and drift correction processing includes Text replacement, text are repeated and based on three kinds of drift correction processing of model customization, referred to as the text based on domain knowledge is inclined Poor correction technique；

Further, the process description of text replacement algorithm:

(1) descriptor C is given, corresponding reference data set F is chosen by descriptor C；

(2) found out inside reference data set F all related with descriptor C, and topic relativity is greater than the word of threshold value, Form a set P；The determination method of set P is carried out by dictionary wordNet, by choose descriptor subordinate relation, at Member's relationship, implication relation extract Candidate Set, by comparing the similitude sim and threshold value of each word p in descriptor C and P K obtains final Candidate Set P；

(3) noun similar with C is found out with second step inside the initial comment collection IR of generation, forms a set Q；It is right A word in Q, score according to the degree of correlation, with the word in P come random replacement；

The process description of text repetition algorithm:

For any sentence s ∈ IR, part of speech judgement is carried out to word token therein, to judge adjective, pair Word, verb structure token, if word belongs to synonym dictionary Syn, the token memory translate and retranslate Process, will be nearest with former token distance by carrying out the judgement of cosine similarity so that obtain token reports word Word is replaced, to obtain the repetition text FRpa for repeating algorithm based on text；

Process description based on model customization algorithm:

Text corresponding types in FR comment collection are extracted using type () function, to interrogative sentence in comment collection and exclamation Sentence type, corresponding templates type set T in template library TR is extracted；It, can by judging the template type t' of sentence sent It to obtain t' template slot and corresponding part of speech sequence, is carried out with template t corresponding, translate function is used to carry out template groove location Exchange, realize based on template customization text sent' generation.

The invention has the following advantages:

The research of existing text generation is very mature, but the research for designing Twitter platform is also fewer, for A problem of upper section proposes, diversity, randomness, nature of leisure of the present invention towards Twitter comment text press comment text It is divided into according to language mode different classes of, for different classes of, targetedly generates the text of respective classes.Using based on LSTM The NLG technology of study, the probabilistic relation between each character obtained by study semantic, character come the vision to sentence structure Type and each character encoded.Fusion in terms of having carried out semantic and syntax to the comment information being intended by, And the later period passes through the methods of specific word replacement, generation and almost consistent lively, clear and coherent, the changeful high quality of social networks Comment text.

The social networks specific area that the present invention mainly studies comments on generation technique, provides advantageous material for public sentiment guidance Corpus, by propagating more true, trustworthy speeches, for netizen provide one it is reliable, actively, it is health, upward Mainstream of public opinions environment, restore the network environment of positive energy.The present invention can be used as material corpus and be input to existing carriage In the system of feelings guidance, certain a large amount of reaction speeches are broken through in time.To purifying Internet environment, the public opinion of the country and people is ensured Atmosphere weakens hostile force, and the stable development in an all-round way of safeguarding national security of building a harmonious society is of great significance.

Detailed description of the invention

Fig. 1 is the block diagram that the comment text based on LSTM generates model, and Fig. 2 is the exemplary schematic diagram of drift correction, and Fig. 3 is Increase contrast and experiment figure (Fig 3Add the results of the comparative whether classification experiment of classification or not)；Fig. 4 is IR and FR text F₁Value compares figure (Fig 4Compare the text F₁values of IR and FR)；

Fig. 5 is comparative result figure (the Fig 5The model trains of model training Twitter data and Yelp data Twitter data to be compared with Yelp data)；

Fig. 6 is repetitive rate variation diagram (the Fig 6The variation diagram of the under each sample value repetition rate under each sample value)。

Specific embodiment

The realization of overall plan of the present invention is illustrated as follows in conjunction with attached drawing:

1, since Twitter comment text has diversity, randomness, nature of leisure, for politics, health, education, joy The characteristics of happy, five fields of science and technology are commented on, the architectural characteristic for combining language classifies to comment text.For different Classification designs different LSTM models, by learn obtained Probability Structure is semantic to the vision of word structure, type of word with And a other word is encoded.Fusion in terms of having carried out semantic and syntax to the comment information being intended by generates different The initial comment of classification.According to it is different classes of itself the characteristics of, formulate corresponding text-processing strategy and carry out correction model.It generates Almost consistent high quality reviews text is really commented on social networks.A specific area D, the neck are given under social networks W The hot topic collection that domain includes is combined into T={ T₁,T₂,…,T_n, select a certain topic T_i, for topic T_iSelect a certain specific master P is pasted, the comment text collection under main patch P is crawled, is expressed as RR={ RR₁,RR₂,…,RR_n, by classification processing, we are filtered out Different classes of data are denoted as FlR={ FlR₁,FlR₂,…,FlR_n, it is separately input to the LSTM model of different parameters In, the initial comment collection of respective classes is generated, IR={ IR is expressed as₁,IR₂,…,IR_n}.For different classes of feature, respectively To IR_iDifferent strategies is formulated to carry out drift correction, final comment collection is generated and is expressed as FR={ FR₁,FR₂,…,FR_n, such as Formula (1) is to shown in (3).Wherein i ∈ { 1,2,3 ..., n }, wherein text classification of the c function stand based on Random Forest model Process, text generation process of the h function stand based on LSTM, z_jFunction stand is replaced by text, text is repeated, model customization group At drift correction process.

C(RR)→FlR_i (1)

h(W,D,T,P,FlR_i)→IR_i (2)

z_j(IR_i)→FR_i j∈{1,2,3} (3)

2, the sentence structure classification based on Random Forest model

User is higher and higher to the participation of the social event broken out suddenly, to generate large-scale comment text. Twitter Commentary Writing style is complicated and changeable, and identification is low, and a large amount of text input machine learning model is directly carried out text It generates, style is difficult to acquire, and can not obtain expected good result.So this patent is first by comment text from the sentence of sentence The angle of formula structure is classified, and the single style comment that speech has identification high is obtained.Data are created by crawler first Collection is segmented, the processing such as part-of-speech tagging, sentence segmentation, since the feature of text will cause feature disaster excessively to being formed The problems such as over-fitting, therefore the operation of feature extraction and extraction has been carried out to text, the feature vector for indicating text is obtained, Input Random Forest model in, by random forest training obtained the output of text classification, finally by comment text be divided into Lower seven classes: principal series table structure compares level structure, interrogative sentence structure, exclamative sentence structure, Subject, Predicate and Object structure, and Subject, Predicate and Object guest mends structure, Imperative sentence structure.

Feature extraction is a step of text classification most critical, and the serviceability of extracted feature directly affects classification results Quality.In assorting process, if feature dimension is excessively high, it may occur that dimension disaster generates over-fitting, and noise data is excessive etc. Phenomenon.Therefore it needs to carry out Feature Dimension Reduction using feature selecting and feature extraction.The feature of selection is indicated with identifier, under Face will introduce these features one by one.

(1) Word embedding+tf-idf feature

Vectorization is carried out to each word by selecting word embedding vector obtained by CBOW model training. Shown in the objective function of CBOW such as formula (4).For word w_t, word context is Context (w_t)={ w_t-b,..., w_t-1,w_t+1,...,w_t+b}.Wherein, b constant is used to determine the contextual window size of word.The window of model accuracy and word Mouth size b is positively correlated.In the research of this chapter, the word dimension of window size b=4, final vectorization are determined as 120.Together When introduce this feature selection approach of tf-idf each of text word embedding vector is considered, obtain Word embedding+tf-idf feature, is named as Wetfidf.

(2) WfreMatrix (word frequency matrix) feature

For WfreMatrix feature for indicating frequency matrix, its row indicates the number of article, and column indicate all in article The word occurred, the element in matrix are the frequency that word occurs.The statistics of word frequency is calculated by the interface that sklearn is provided It completes.

(3) Pos (Part of speech) feature

The sentence structure of English follows certain rule, has apparent template, the part of speech with word each in text There are very big relationship, the relative ranks between word, dependence decides the trend of clause.Comment on more colloquial styles, clause letter Short, mostly simple sentence, having altogether through statistics part of speech includes 10 kinds.Using pos_tag_sents function in NLTK tool to text into Row part-of-speech tagging.

(4) Key feature

Key feature is used to indicate keyword feature.Keyword refer to it is some can represent in many cases one classification Word.For example, basic judgement can be carried out by word " than " by comparing level structure；For exclamative sentence structure How/What with sigh Number combination, can be locked well.And the link-verb in principal series table structure is an apparent mark.Key is calculated as public Formula (5).

(5) Index feature

Index feature is used to indicate the feature of word position sequence.For imperative sentence structure, most of is verb beginning, i.e., dynamic The position sequence of word is one, is a good feature.For Subject, Predicate and Object structure, subject, predicate, object relative ranks, be one Important criterion of identification.

(6) Punc feature

Punc feature is used to indicate punctuation mark feature.The type of punctuation mark is to differ widely for different clause , for interrogative sentence structure, the presence of question mark is a very big feature；Exclamation is used to indicate exclamative sentence or imperative sentence.It is right In the punctuation mark feature of text, counted by the position sequence to fixed punctuation mark sequence to extract punc feature.

3, the comment text based on LSTM automatically generates

Building is encoded-is decoded the text generation model of structure based on LSTM, and basic structure is as shown in Figure 1.It is one given Twitter comments on short text, encodes first to the text of input, in decoding stage, the difference generated one by one using LSTM The probability distribution of the candidate characters at moment, then by a kind of reasonable sampling techniques (as greedy sampling, random sampling, Beam are searched Rope etc.) character of next appearance is selected and determined, the character string of composition constitutes the natural language to input semantic item Description.Coding side and decoding end are made of single layer LSTM.A context vector C, this vector conduct are generated in coding stage The input of decoding stage exports final sequence data in decoding end.As shown in Figure 1.

4, the text drift correction based on domain knowledge

For make generate comment be close to model theme, that is, have higher topic relativity, and be true to life for Principle has collected relatively comprehensive priori knowledge, i.e. domain knowledge for an event itself, for the unrelated or phase with theme The low comment of closing property, and the comment runed counter to the fact carry out drift correction processing, including text replacement, text repetition and base It is handled in three kinds of model customization, referred to as the text drift correction technology based on domain knowledge.By taking the replacement of nominal text as an example Son is illustrated, under this hot spot theme of certain state leader's general election, RR₁(It is a great book) this comment is bright Aobvious is a comment unrelated with theme, carries out text replacement operation to it, as shown in Fig. 2, in initially commenting on theme without The word mark green that the needs of pass are replaced marks, and the candidate word for being used to replace inside the primary comment collection on Twitter platform is used Red mark, the word yellow flag after replacement.

5, algorithm description

Theme to make the comment ultimately generated and target hot spot send an invitation has higher topic relativity, proposes to be based on The method of the text replacement of noun.The determination method of set P is carried out by wordNet, by choosing the subordinate relation of descriptor, Member relation, implication relation extract Candidate Set P ', by comparing each word p in descriptor C and P similitude sim and Threshold value k obtains final Candidate Set P.Some nouns similar with C are found out with second step inside the initial comment collection R of generation, Form a set Q.For a word in Q, score according to the degree of correlation, with the word in P come random replacement.

Text replacement algorithm based on noun is expressed as follows:

Algorithm 1:Text Replacement Method

Input: initial comment collection IR, classify comment collection FiLR, descriptor C, similarity threshold MINsim

Output: final comment collection FR_noun

Step 1: finding the word set composition P set close with C in classification comment collection

For t∈FilR

For n∈Nouns(t)

A) initialization set

B) all words relevant with descriptor C are found out inside reference data set F

C) word that topic relativity is greater than threshold value is filtered out, a set P is formed

END For

Set Q is formed with C close word set step 2: finding in initial comment collection

For n∈Nouns(R)

Some nouns similar with C are found out in the initial comment collection R of generation, form a set Q

For p∈P do

For a word in Q, score according to the degree of correlation, with the word in P come random replacement

END For

Repeat a variety of expression ways i.e. to same semanteme.In the research of text generation, repetition can be applied to In the automatic rewriting for the sentence that LSTM model generates, it can help to generate more smooth lively text.Especially in " vocabulary choosing Select " this link can select to be used flexible and changeablely when expressing a certain semantic according to different context of co-texts Vocabulary, the abundant corpus ultimately generated.It is as follows that text repeats algorithm specific manifestation:

Algorithm 2:Text Paraphrases Method

Input: initial comment collection-IR, adjective-ADJ, adverbial word-ADV

Output: text FR is repeated_Pa

For each node e in s

(1) part of speech judgement is carried out to word token therein, to the adjective, adverbial word, verb structure token judged

(2) if word belongs to synonym dictionary Syn, which marks the process of translate and retranslate

(3) obtain token reports word

(4) it by carrying out the judgement of cosine similarity, will be replaced with former token apart from nearest word

END For

" template " refers to abstract expression that is extensive from phrase, sentence these natural languages and coming.Just because of template and phase The example answered, which is compared, stronger representativeness, therefore is widely used in the research work of spatial term.One template It is made of template word (pattern words) and template slot (pattern slots) two parts, wherein template word can be considered template Constant part, template slot is then considered as the variable part of template.The statistical induction from the corpus for the syntax gauge largely collected The fixed template of form out, according to the matching degree of input item and template, to determine the different instances generated.The same template can be with It is instantiated as a variety of examples, thus more rich language material library.Algorithm based on model customization is specifically expressed as follows

Algorithm 3:Template-based Text Customization Method

Input: the comment collection FR for completion of classifying, template library TR

Output: the comment of addition customization text and FR_im

For sent in FR:

Type () function is for extracting text corresponding types in FR comment collection

Corresponding templates type set T in template library TR is extracted

By the template type t' for judging sentence sent

(4) t' template slot and corresponding part of speech sequence are obtained,

(5) carried out with template t it is corresponding, use translate function carry out template groove location displacement

END For

It is verified as follows for technical effect of the invention:

Quality evaluation is carried out to model using accuracy rate (Precision) and recall rate (Recall).In such situation Under, the ratio of the comment item number that the accuracy rate machine that then method of calculating with finger detected generates and all comment item numbers that detected, It is the precision ratio for measuring experimental result.Recall rate is then that the comment item number that the machine that algorithm detected generates and all machines produce The ratio of raw comment item number, measures the recall ratio of experimental result.Accuracy rate has more for the comment that assessment algorithm detected It is the false comment that machine generates less, how many falseness that recall rate is used to assess all machines generations commented on and be retrieved. F₁Value indicates the harmomic mean of accuracy rate and recall rate, as accuracy rate with recall rate is lower means comment caused by machine It is more difficult to be detected.Therefore, in order to assess the authenticity of false comment generated, therefore accuracy rate, recall rate and F₁ The lower value the better.

The present invention is different with the research of previous text generation, in terms of having carried out English sentence structure before the text generation To compare experiment, i.e., in order to prove the validity of addition this link of text classification both of which is arranged in the link of classification Control its dependent variable it is all the same in the case where, model one be comment text generate experiment before without text classification grasp Make, model two is that text classification operation is carried out before comment text generates experiment, and the F under both of which is illustrated in Fig. 3₁ Value, from the results, it was seen that the F for the model being added after text classification operation₁Value is significantly lower than the model that do not classify in advance, To illustrate the indispensable property of classification link.Processing is replaced into through text without the IR of text replacement processing in each field FR afterwards is compared, as shown in figure 4, it can be found that after text-processing F₁Value is greatly reduced, it is seen that text-processing pair The performance of generated text plays huge improvement result.

Experimental data set using disclosed experimental data set (commenting on collection in the restaurant on the website Yelp) with this research (comment on Twitter platform) as a comparison, what Fig. 5 was indicated is that this model learns Yelp platform and Twitter platform respectively On comment accuracy rate the results show that, it is evident that two figures are almost without difference from figure, and accuracy rate is all relatively low, from And more illustrating this model has stronger cross-platform adaptability.

Many comments on website are simply to be copied into thousand up to a hundred to drive public opinion direction, or to part duplication Comment, which is modified, forms a new comment.It is team's crime that these comments, which are easy to find out at a glance, takes great pains to build up, carrys out band Dynamic public opinion.The comment of these massive duplications is easy to the comment for being defined as being not trusted.Thus by based on K-gram's Winnowing plagiarizes duplicate checking technology, and the part in the FR and database of this patent in Twitter platform is really commented on and is compared Compared with carrying out duplicate checking detection with true comment in the two and database, the repetitive rate under different samples finally obtained, such as Fig. 6 institute Show.True comment is stablized 0.08 or so since there is no plagiarizing the phenomenon that plagiarizing, will not with sample increase or subtract Originally fluctuation less, the FR of this research increase with the increase repetitive rate of sample, repetitive rate declines when sample rate is 0.5, When sample rate is 0.8, it is lower than 0.08, it can be seen that the FR repetitive rate of this research is lower than true comment.

Claims

1. a kind of social networks based on LSTM comments on generation method, which is characterized in that the described method includes:

Classify to comment text, comment text is divided into seven kinds of classifications: principal series table structure, compares level structure, interrogative sentence knot Structure, exclamative sentence structure, Subject, Predicate and Object structure, Subject, Predicate and Object guest mend structure, imperative sentence structure；For different classifications, design different LSTM model obtains Probability Structure by the study to each LSTM model, generates different classes of initial comment IR_i, inferior horn Mark i indicate seven kinds of classifications, i be equal to 1,2,3 ... 7；

According to it is different classes of itself the characteristics of, formulate corresponding text-processing strategy to correct corresponding LSTM model, in turn It generates and really comments on consistent high quality reviews text FR with social networks_i；

A specific area D is given under social networks W, the hot topic collection which includes is combined into T={ T₁,T₂,…,T_n, choosing Fixed a certain topic T_i, for topic T_iA certain specific main patch P is selected, the comment text collection under main patch P is crawled, is expressed as RR= {RR₁,RR₂,…,RR_n, by classification processing, different classes of data are filtered out, are denoted as FlR={ FlR₁, FlR₂,…,FlR_n, it is separately input in the LSTM model of different parameters, generates the initial comment collection of respective classes, be expressed as IR ={ IR₁,IR₂,…,IR_n}；For different classes of feature respectively to IR_iDifferent strategies is formulated to carry out drift correction, is given birth to FR={ FR is expressed as at final comment collection₁,FR₂,…,FR_n, as shown in formula (1) to (3)；Wherein i ∈ { 1,2,3 ..., n }, N is equal to 7, and wherein C function represents the text classification process based on Random Forest model, and it is raw that function h represents the text based on LSTM At process, function z_jRepresent drift correction process；Drift correction includes text replacement, text is repeated and three kinds of plans of model customization Slightly；

C(RR)→FlR_i (1)

h(W,D,T,P,FlR_i)→IR_i (2)

z_j(IR_i)→FR_i j∈{1,2,3} (3)。

2. a kind of social networks based on LSTM according to claim 1 comments on generation method, which is characterized in that

The process classified based on Random Forest model to comment text are as follows:

Secondly the operation that feature extraction and extraction are carried out to text, obtains the feature vector for indicating text, and input is random gloomy In woods model, the output of text classification is obtained by random forest training, comment text is finally divided into following seven class: principal series Table structure compares level structure, interrogative sentence structure, exclamative sentence structure, Subject, Predicate and Object structure, Subject, Predicate and Object guest benefit structure, imperative sentence structure；

Feature Dimension Reduction is carried out using feature selecting and feature extraction in stating feature extraction, by the feature of selection identifier table Show, the specific features of extraction are as follows:

(1) Word embedding+tf-idf feature

Vectorization is carried out to each word by selecting word embedding vector obtained by CBOW model training；CBOW Objective function such as formula (4) shown in；For word w_t, word context is Context (w_t)={ w_t-b,...,w_t-1, w_t+1,...,w_t+b}；Wherein, b constant is used to determine the contextual window size of word, window size b=4, final vectorization Word dimension is determined as 120；Tf-idf feature selection approach is introduced simultaneously to examine each word embedding vector Amount, obtains word embedding+tf-idf feature, is named as Wetfidf；

(2) WfreMatrix (word frequency matrix) feature

WfreMatrix feature is used to indicate frequency matrix, and row indicates the number of article, column indicate in comment text it is all go out The word now crossed, the element in matrix are the frequency that word occurs；Word frequency statistics are had been calculated by the interface that sklearn is provided At；

(3) Pos (Part of speech) feature

(4) Key feature

Key feature is used to indicate that keyword feature, keyword are the words for referring to represent a classification, and Key calculates such as formula (5),

(5) Index feature

Index feature is used to indicate the feature of word position sequence；

(6) Punc feature

Punc feature is used to indicate punctuation mark feature, be counted by the position sequence to fixed punctuation mark sequence to extract Punc feature.

3. a kind of social networks based on LSTM according to claim 2 comments on generation method, which is characterized in that for not Same classification, designs different LSTM models, obtains Probability Structure by the study to each LSTM model, generates different classes of Initial comment IR_i, process are as follows:

Building is encoded-is decoded the text generation model of structure based on LSTM, is given a Twitter and is commented on short text, right first The text of input is encoded, in decoding stage, using the probability distribution of the candidate characters of the LSTM different moments generated one by one, The character of next appearance is selected and determined by a kind of reasonable sampling techniques again, the character string of composition is constituted to defeated Enter the natural language description of semantic item；Coding side and decoding end are made of single layer LSTM；About one is generated in coding stage Literary vector C, input of this vector as decoding stage, final sequence data is exported in decoding end.

4. a kind of social networks based on LSTM according to claim 3 comments on generation method, which is characterized in that according to not It is generic itself the characteristics of, formulate corresponding text-processing strategy to correct corresponding LSTM model, so generate with it is social Network really comments on consistent high quality reviews text FR_i, process are as follows:

For a topic T_iItself collects relatively comprehensive priori knowledge, i.e. domain knowledge, for unrelated or related to theme Property low comment, and the comment runed counter to the fact carries out drift correction processing, and drift correction processing includes text replacement, text It repeats and based on three kinds of drift correction processing of model customization, referred to as the text drift correction based on domain knowledge.

5. a kind of social networks based on LSTM according to claim 4 comments on generation method, which is characterized in that

The process description of text replacement algorithm:

(2) found out inside reference data set F all related with descriptor C, and topic relativity is greater than the word of threshold value, composition One set P；The determination method of set P is carried out by dictionary wordNet, by the subordinate relation, the Cheng Yuanguan that choose descriptor System, implication relation extract Candidate Set, by comparing the similitude sim and threshold value k of each word p in descriptor C and P, obtain To final Candidate Set P；

(3) noun similar with C is found out with second step inside the initial comment collection IR of generation, forms a set Q；For Q In a word, score according to the degree of correlation, with the word in P come random replacement；

The process description of text repetition algorithm:

For any sentence s ∈ IR, part of speech judgement is carried out to word token therein, to the adjective judged, adverbial word, is moved Word structure token, if word belongs to synonym dictionary Syn, the mistake of the token memory translate and retranslate Journey, so that obtain token reports word, by carrying out the judgement of cosine similarity, by with former token apart from nearest word into Row replacement, to obtain the repetition text FRpa for repeating algorithm based on text；

Process description based on model customization algorithm:

Text corresponding types in FR comment collection are extracted using type () function, to interrogative sentence in comment collection and exclamative sentence class Type extracts corresponding templates type set T in template library TR；By judging the template type t' of sentence sent, can obtain It to t' template slot and corresponding part of speech sequence, is carried out with template t corresponding, translate function is used to carry out the tune of template groove location It changes, realizes the generation of the customization text sent' based on template.