CN110390018A - A kind of social networks comment generation method based on LSTM - Google Patents

A kind of social networks comment generation method based on LSTM Download PDF

Info

Publication number
CN110390018A
CN110390018A CN201910680645.7A CN201910680645A CN110390018A CN 110390018 A CN110390018 A CN 110390018A CN 201910680645 A CN201910680645 A CN 201910680645A CN 110390018 A CN110390018 A CN 110390018A
Authority
CN
China
Prior art keywords
text
word
comment
feature
lstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910680645.7A
Other languages
Chinese (zh)
Inventor
何慧
张伟哲
方滨兴
邰煜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN201910680645.7A priority Critical patent/CN110390018A/en
Publication of CN110390018A publication Critical patent/CN110390018A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Databases & Information Systems (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Business, Economics & Management (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

A kind of social networks comment generation method based on LSTM, belongs to social networks comment generation technique field.The present invention be understand existing social networks comment on scene applied by generation technique be excessively narrow it is single, the problem of material database is provided can not be drawn to public sentiment.The present invention uses the NLG technology learnt based on LSTM, and come the vision to sentence structure, the semantic, type of character and each character encode the probabilistic relation between each character obtained by study.Fusion in terms of having carried out semantic and syntax to the comment information being intended by, and the later period passes through the methods of specific word replacement, generation and almost consistent lively, clear and coherent, the changeful high quality reviews text of social networks.The present invention provides advantageous material corpus for public sentiment guidance, by propagating more true, trustworthy speeches, restores the network environment of positive energy.The present invention can be used as material corpus and be input in the system of existing public sentiment guidance, generate for the comment of social networks specific area.

Description

A kind of social networks comment generation method based on LSTM
Technical field
The present invention relates to a kind of, and the social networks based on LSTM comments on generation method, belongs to social networks comment generation technique Field.
Background technique
Nowadays, online social network-i i-platform be greatly promoted the life of netizen with exchange, occurrences in human life object all over the world because It is closely connected for network, people are higher and higher to the participation of network event, generate numerous social networks comments therefrom. Comment represents a kind of language, a kind of sound, is ideological reflection, text sentence is concise, be intended to clear, various structures, It is the ideal place of test text Auto.Certain user can be expressed for a certain focus incident by being posted by The view and position of oneself, or approve of, or neutral, or negative, or be the network rumour under a kind of driving of interests.This patent is by society It hands over network locked in Twitter platform, collects political, health on this platform, education, amusement, user's life in five fields of science and technology At content (User Generated Content, UGC), that mainly completes the comment text in Twitter automatically generates work. Text Auto belongs to one of research field of artificial intelligence, its main thought is according to will input computer-internal Information, the Analysis of Deep Implications for the information of being generated is cooked up according to the text planning device of computer-internal, then device is realized by text This meaning is converted to the language construction of grammatical, is exported in the form of comment text.Artificial intelligence is current scientific development Hot spot, text generation technology has gradually obtained the concern of people, while the application ten of text generation technology in real life Divide extensively, strong influence is brought to human work and life.Existing literature CN108256968A discloses a kind of electric business platform Commodity comment of experts generation method, the document propose that a kind of comment of experts for generating model based on sequence to sequence is summarized and generate skill Art extracts the important information in all user comments of certain commodity, generates the language of one section of summing-up to describe the characteristic of commodity.Disappear Expense person can understand the advantage and disadvantage of commodity according to the comment of experts of generation, consider whether to buy;Businessman can be according to generation Comment of experts improves oneself commodity.The present invention can be extracted with the important comment for representing product characteristics, can be quotient Family improves commodity and provides reference well, and businessman is allowed to promote the user experience of product, improves sales volume, additional income.Its energy simultaneously Purchase reference enough is provided for consumer, promotes the shopping experience of consumer;It may also help in electric business platform and attract more viscosity User expands the influence power of itself.The document does not propose to generate comment by deep learning.
From the point of view of the general situation of development that domestic and international natural language text generates, existing spatial term technology, in social activity The comment of network specific area generates aspect, has the following problems.
(1) spatial term research aspect has had the model of very more comparative maturities.Natural language text generates It focuses mostly in interactive system, machine translation, information retrieval, text classification, automatic abstract, the target text of research is mostly rule The text collection of model, or the article to deliver, or be the data set of disclosed specification, it is related to social networks comment text generation side The research in face is seldom.
(2) the way of production of network comment is varied, social networks (such as Facebook, Twitter, Weibo, RenRen), e-commerce (such as Amazon, Alibaba, Dangdang.com), mail service (such as Gmail, Yahoo, E-mail), net Network forum (such as ends of the earth, Netease, bean cotyledon).It is related to the more of the comment text generation of e-commerce and mail service platform at present, And there is no the generations that a kind of pervasive method realizes social networks comment text.
(3) the existing model generated for comment text, the language mode fixed single of research.On Twitter platform Comment be directed to different field focus incident, event have more it is sudden, thoughts i.e. send out, be mostly with first impression it is leading, Comment has randomness more, and diversity, rule is difficult to capture.Although the comment generation on Yelp social network sites achieves plentiful and substantial Achievement, but it only relates to user in the comment mode of public comment, it then follows a fairly standard structure is with experience Leading, without preamble, link closely theme, comments in crucial point, and mostly being disliked with happiness is leading, mode fixed single, on the whole application scenarios Excessively stable, rule is easy to capture.And the and comment that is not suitable on Twitter platform.
Summary of the invention
The technical problem to be solved in the present invention:
The present invention be understand existing social networks comment on scene applied by generation technique be excessively narrow it is single, can not be right Public sentiment draws the problem of providing material database, and then proposes a kind of social networks comment generation method based on LSTM.
The present invention solve above-mentioned technical problem the technical solution adopted is that:
A kind of social networks comment generation method based on LSTM, which comprises
Classify to comment text, comment text is divided into seven kinds of classifications: principal series table structure, compares level structure, query Sentence structure, exclamative sentence structure, Subject, Predicate and Object structure, Subject, Predicate and Object guest mend structure, imperative sentence structure;For different classifications, design is not Same LSTM model, obtains Probability Structure by the study to each LSTM model, generates different classes of initial comment IRi, Subscript i indicate seven kinds of classifications, i be equal to 1,2,3 ... 7;
According to it is different classes of itself the characteristics of, formulate corresponding text-processing strategy to correct corresponding LSTM model, And then it generates and really comments on consistent high quality reviews text FR with social networksi
A specific area D is given under social networks W, the hot topic collection which includes is combined into T={ T1,T2,…, Tn, select a certain topic Ti, for topic TiA certain specific main patch P is selected, the comment text collection under main patch P is crawled, is expressed as RR={ RR1,RR2,…,RRn, by classification processing, different classes of data are filtered out, are denoted as FlR={ FlR1, FlR2,…,FlRn, it is separately input in the LSTM model of different parameters, generates the initial comment collection of respective classes, be expressed as IR ={ IR1,IR2,…,IRn};For different classes of feature respectively to IRiDifferent strategies is formulated to carry out drift correction, is given birth to FR={ FR is expressed as at final comment collection1,FR2,…,FRn, as shown in formula (1) to (3);Wherein i ∈ { 1,2,3 ..., n }, N is equal to 7, and wherein C function represents the text classification process based on Random Forest model, and it is raw that function h represents the text based on LSTM At process, function zjRepresent drift correction process;Drift correction includes text replacement, text is repeated and three kinds of plans of model customization Slightly;
C(RR)→FlRi (1)
h(W,D,T,P,FlRi)→IRi (2)
zj(IRi)→FRi j∈{1,2,3} (3)。
It further, is (i.e. based on random forest mould to the process that comment text is classified based on Random Forest model The sentence structure of type is classified):
Data set is created by crawler first, is segmented, part-of-speech tagging, sentence dividing processing,
Secondly the operation that feature extraction and extraction are carried out to text, obtains the feature vector for indicating text, input with In machine forest model, the output of text classification is obtained by random forest training, comment text is finally divided into following seven class: Principal series table structure compares level structure, interrogative sentence structure, exclamative sentence structure, Subject, Predicate and Object structure, Subject, Predicate and Object guest benefit structure, imperative sentence Structure;
Feature Dimension Reduction is carried out using feature selecting and feature extraction in stating feature extraction, the feature of selection is identified Symbol expression, the specific features of extraction are as follows:
(1) Word embedding+tf-idf feature
Vectorization is carried out to each word by selecting word embedding vector obtained by CBOW model training; Shown in the objective function of CBOW such as formula (4);For word wt, word context is Context (wt)={ wt-b,..., wt-1,wt+1,...,wt+b};Wherein, b constant is used to determine the contextual window size of word, window size b=4, final vector The word dimension of change is determined as 120;Simultaneously introduce tf-idf feature selection approach to each word embedding vector into Row is considered, and is obtained word embedding+tf-idf feature, is named as Wetfidf;
(2) WfreMatrix (word frequency matrix) feature
WfreMatrix feature is used to indicate that frequency matrix, row indicate the number of article, and column indicate institute in comment text The word occurred, the element in matrix are the frequency that word occurs;Word frequency statistics are calculated by the interface that sklearn is provided It completes;
(3) Pos (Part of speech) feature
Part-of-speech tagging is carried out to comment text using pos_tag_sents function in NLTK tool;
(4) Key feature
Key feature is used to indicate that keyword feature, keyword are the words for referring to represent a classification, and Key is calculated as public Formula (5),
(5) Index feature
Index feature is used to indicate the feature of word position sequence;
(6) Punc feature
Punc feature is used to indicate punctuation mark feature, be counted by the position sequence to fixed punctuation mark sequence to take out Take punc feature.
This above-mentioned process illustrates formula (1) is how to realize.
Further, for different classifications, different LSTM models is designed, learning to each LSTM model is passed through To Probability Structure, different classes of initial comment IR is generatedi(comment text based on LSTM automatically generates), process are as follows:
Building is encoded-is decoded the text generation model of structure based on LSTM, is given a Twitter and is commented on short text, first First the text of input is encoded, in decoding stage, uses the probability of the candidate characters of the LSTM different moments generated one by one Distribution, then it is next to select and determine by a kind of reasonable sampling techniques (such as greedy sampling, random sampling, Beam are searched for) The character of a appearance, the character string of composition constitute the natural language description to input semantic item;Coding side is equal with decoding end It is made of single layer LSTM;A context vector C is generated in coding stage, input of this vector as decoding stage is decoding End exports final sequence data.This above-mentioned process illustrates formula (2) is how to realize.
Below this 4 corresponding technical solution of power and 5 corresponding technical solutions of power come together to illustrate formula (3) be as What what was realized.
Further, according to it is different classes of itself the characteristics of, it is corresponding to correct to formulate corresponding text-processing strategy LSTM model, and then generate and really comment on consistent high quality reviews text FR with social networksi(the text based on domain knowledge Drift correction technology), process are as follows:
For topic (event) TiItself collects relatively comprehensive priori knowledge, i.e. domain knowledge, for theme Unrelated or low correlation comment, and the comment runed counter to the fact carry out drift correction processing, and drift correction processing includes Text replacement, text are repeated and based on three kinds of drift correction processing of model customization, referred to as the text based on domain knowledge is inclined Poor correction technique;
Further, the process description of text replacement algorithm:
(1) descriptor C is given, corresponding reference data set F is chosen by descriptor C;
(2) found out inside reference data set F all related with descriptor C, and topic relativity is greater than the word of threshold value, Form a set P;The determination method of set P is carried out by dictionary wordNet, by choose descriptor subordinate relation, at Member's relationship, implication relation extract Candidate Set, by comparing the similitude sim and threshold value of each word p in descriptor C and P K obtains final Candidate Set P;
(3) noun similar with C is found out with second step inside the initial comment collection IR of generation, forms a set Q;It is right A word in Q, score according to the degree of correlation, with the word in P come random replacement;
The process description of text repetition algorithm:
For any sentence s ∈ IR, part of speech judgement is carried out to word token therein, to judge adjective, pair Word, verb structure token, if word belongs to synonym dictionary Syn, the token memory translate and retranslate Process, will be nearest with former token distance by carrying out the judgement of cosine similarity so that obtain token reports word Word is replaced, to obtain the repetition text FRpa for repeating algorithm based on text;
Process description based on model customization algorithm:
Text corresponding types in FR comment collection are extracted using type () function, to interrogative sentence in comment collection and exclamation Sentence type, corresponding templates type set T in template library TR is extracted;It, can by judging the template type t' of sentence sent It to obtain t' template slot and corresponding part of speech sequence, is carried out with template t corresponding, translate function is used to carry out template groove location Exchange, realize based on template customization text sent' generation.
The invention has the following advantages:
The research of existing text generation is very mature, but the research for designing Twitter platform is also fewer, for A problem of upper section proposes, diversity, randomness, nature of leisure of the present invention towards Twitter comment text press comment text It is divided into according to language mode different classes of, for different classes of, targetedly generates the text of respective classes.Using based on LSTM The NLG technology of study, the probabilistic relation between each character obtained by study semantic, character come the vision to sentence structure Type and each character encoded.Fusion in terms of having carried out semantic and syntax to the comment information being intended by, And the later period passes through the methods of specific word replacement, generation and almost consistent lively, clear and coherent, the changeful high quality of social networks Comment text.
The social networks specific area that the present invention mainly studies comments on generation technique, provides advantageous material for public sentiment guidance Corpus, by propagating more true, trustworthy speeches, for netizen provide one it is reliable, actively, it is health, upward Mainstream of public opinions environment, restore the network environment of positive energy.The present invention can be used as material corpus and be input to existing carriage In the system of feelings guidance, certain a large amount of reaction speeches are broken through in time.To purifying Internet environment, the public opinion of the country and people is ensured Atmosphere weakens hostile force, and the stable development in an all-round way of safeguarding national security of building a harmonious society is of great significance.
Detailed description of the invention
Fig. 1 is the block diagram that the comment text based on LSTM generates model, and Fig. 2 is the exemplary schematic diagram of drift correction, and Fig. 3 is Increase contrast and experiment figure (Fig 3Add the results of the comparative whether classification experiment of classification or not);Fig. 4 is IR and FR text F1Value compares figure (Fig 4Compare the text F1values of IR and FR);
Fig. 5 is comparative result figure (the Fig 5The model trains of model training Twitter data and Yelp data Twitter data to be compared with Yelp data);
Fig. 6 is repetitive rate variation diagram (the Fig 6The variation diagram of the under each sample value repetition rate under each sample value)。
Specific embodiment
The realization of overall plan of the present invention is illustrated as follows in conjunction with attached drawing:
1, since Twitter comment text has diversity, randomness, nature of leisure, for politics, health, education, joy The characteristics of happy, five fields of science and technology are commented on, the architectural characteristic for combining language classifies to comment text.For different Classification designs different LSTM models, by learn obtained Probability Structure is semantic to the vision of word structure, type of word with And a other word is encoded.Fusion in terms of having carried out semantic and syntax to the comment information being intended by generates different The initial comment of classification.According to it is different classes of itself the characteristics of, formulate corresponding text-processing strategy and carry out correction model.It generates Almost consistent high quality reviews text is really commented on social networks.A specific area D, the neck are given under social networks W The hot topic collection that domain includes is combined into T={ T1,T2,…,Tn, select a certain topic Ti, for topic TiSelect a certain specific master P is pasted, the comment text collection under main patch P is crawled, is expressed as RR={ RR1,RR2,…,RRn, by classification processing, we are filtered out Different classes of data are denoted as FlR={ FlR1,FlR2,…,FlRn, it is separately input to the LSTM model of different parameters In, the initial comment collection of respective classes is generated, IR={ IR is expressed as1,IR2,…,IRn}.For different classes of feature, respectively To IRiDifferent strategies is formulated to carry out drift correction, final comment collection is generated and is expressed as FR={ FR1,FR2,…,FRn, such as Formula (1) is to shown in (3).Wherein i ∈ { 1,2,3 ..., n }, wherein text classification of the c function stand based on Random Forest model Process, text generation process of the h function stand based on LSTM, zjFunction stand is replaced by text, text is repeated, model customization group At drift correction process.
C(RR)→FlRi (1)
h(W,D,T,P,FlRi)→IRi (2)
zj(IRi)→FRi j∈{1,2,3} (3)
2, the sentence structure classification based on Random Forest model
User is higher and higher to the participation of the social event broken out suddenly, to generate large-scale comment text. Twitter Commentary Writing style is complicated and changeable, and identification is low, and a large amount of text input machine learning model is directly carried out text It generates, style is difficult to acquire, and can not obtain expected good result.So this patent is first by comment text from the sentence of sentence The angle of formula structure is classified, and the single style comment that speech has identification high is obtained.Data are created by crawler first Collection is segmented, the processing such as part-of-speech tagging, sentence segmentation, since the feature of text will cause feature disaster excessively to being formed The problems such as over-fitting, therefore the operation of feature extraction and extraction has been carried out to text, the feature vector for indicating text is obtained, Input Random Forest model in, by random forest training obtained the output of text classification, finally by comment text be divided into Lower seven classes: principal series table structure compares level structure, interrogative sentence structure, exclamative sentence structure, Subject, Predicate and Object structure, and Subject, Predicate and Object guest mends structure, Imperative sentence structure.
Feature extraction is a step of text classification most critical, and the serviceability of extracted feature directly affects classification results Quality.In assorting process, if feature dimension is excessively high, it may occur that dimension disaster generates over-fitting, and noise data is excessive etc. Phenomenon.Therefore it needs to carry out Feature Dimension Reduction using feature selecting and feature extraction.The feature of selection is indicated with identifier, under Face will introduce these features one by one.
(1) Word embedding+tf-idf feature
Vectorization is carried out to each word by selecting word embedding vector obtained by CBOW model training. Shown in the objective function of CBOW such as formula (4).For word wt, word context is Context (wt)={ wt-b,..., wt-1,wt+1,...,wt+b}.Wherein, b constant is used to determine the contextual window size of word.The window of model accuracy and word Mouth size b is positively correlated.In the research of this chapter, the word dimension of window size b=4, final vectorization are determined as 120.Together When introduce this feature selection approach of tf-idf each of text word embedding vector is considered, obtain Word embedding+tf-idf feature, is named as Wetfidf.
(2) WfreMatrix (word frequency matrix) feature
For WfreMatrix feature for indicating frequency matrix, its row indicates the number of article, and column indicate all in article The word occurred, the element in matrix are the frequency that word occurs.The statistics of word frequency is calculated by the interface that sklearn is provided It completes.
(3) Pos (Part of speech) feature
The sentence structure of English follows certain rule, has apparent template, the part of speech with word each in text There are very big relationship, the relative ranks between word, dependence decides the trend of clause.Comment on more colloquial styles, clause letter Short, mostly simple sentence, having altogether through statistics part of speech includes 10 kinds.Using pos_tag_sents function in NLTK tool to text into Row part-of-speech tagging.
(4) Key feature
Key feature is used to indicate keyword feature.Keyword refer to it is some can represent in many cases one classification Word.For example, basic judgement can be carried out by word " than " by comparing level structure;For exclamative sentence structure How/What with sigh Number combination, can be locked well.And the link-verb in principal series table structure is an apparent mark.Key is calculated as public Formula (5).
(5) Index feature
Index feature is used to indicate the feature of word position sequence.For imperative sentence structure, most of is verb beginning, i.e., dynamic The position sequence of word is one, is a good feature.For Subject, Predicate and Object structure, subject, predicate, object relative ranks, be one Important criterion of identification.
(6) Punc feature
Punc feature is used to indicate punctuation mark feature.The type of punctuation mark is to differ widely for different clause , for interrogative sentence structure, the presence of question mark is a very big feature;Exclamation is used to indicate exclamative sentence or imperative sentence.It is right In the punctuation mark feature of text, counted by the position sequence to fixed punctuation mark sequence to extract punc feature.
3, the comment text based on LSTM automatically generates
Building is encoded-is decoded the text generation model of structure based on LSTM, and basic structure is as shown in Figure 1.It is one given Twitter comments on short text, encodes first to the text of input, in decoding stage, the difference generated one by one using LSTM The probability distribution of the candidate characters at moment, then by a kind of reasonable sampling techniques (as greedy sampling, random sampling, Beam are searched Rope etc.) character of next appearance is selected and determined, the character string of composition constitutes the natural language to input semantic item Description.Coding side and decoding end are made of single layer LSTM.A context vector C, this vector conduct are generated in coding stage The input of decoding stage exports final sequence data in decoding end.As shown in Figure 1.
4, the text drift correction based on domain knowledge
For make generate comment be close to model theme, that is, have higher topic relativity, and be true to life for Principle has collected relatively comprehensive priori knowledge, i.e. domain knowledge for an event itself, for the unrelated or phase with theme The low comment of closing property, and the comment runed counter to the fact carry out drift correction processing, including text replacement, text repetition and base It is handled in three kinds of model customization, referred to as the text drift correction technology based on domain knowledge.By taking the replacement of nominal text as an example Son is illustrated, under this hot spot theme of certain state leader's general election, RR1(It is a great book) this comment is bright Aobvious is a comment unrelated with theme, carries out text replacement operation to it, as shown in Fig. 2, in initially commenting on theme without The word mark green that the needs of pass are replaced marks, and the candidate word for being used to replace inside the primary comment collection on Twitter platform is used Red mark, the word yellow flag after replacement.
5, algorithm description
Theme to make the comment ultimately generated and target hot spot send an invitation has higher topic relativity, proposes to be based on The method of the text replacement of noun.The determination method of set P is carried out by wordNet, by choosing the subordinate relation of descriptor, Member relation, implication relation extract Candidate Set P ', by comparing each word p in descriptor C and P similitude sim and Threshold value k obtains final Candidate Set P.Some nouns similar with C are found out with second step inside the initial comment collection R of generation, Form a set Q.For a word in Q, score according to the degree of correlation, with the word in P come random replacement.
Text replacement algorithm based on noun is expressed as follows:
Algorithm 1:Text Replacement Method
Input: initial comment collection IR, classify comment collection FiLR, descriptor C, similarity threshold MINsim
Output: final comment collection FRnoun
Step 1: finding the word set composition P set close with C in classification comment collection
For t∈FilR
For n∈Nouns(t)
A) initialization set
B) all words relevant with descriptor C are found out inside reference data set F
C) word that topic relativity is greater than threshold value is filtered out, a set P is formed
END For
END For
Set Q is formed with C close word set step 2: finding in initial comment collection
For n∈Nouns(R)
Some nouns similar with C are found out in the initial comment collection R of generation, form a set Q
For p∈P do
For a word in Q, score according to the degree of correlation, with the word in P come random replacement
END For
END For
Repeat a variety of expression ways i.e. to same semanteme.In the research of text generation, repetition can be applied to In the automatic rewriting for the sentence that LSTM model generates, it can help to generate more smooth lively text.Especially in " vocabulary choosing Select " this link can select to be used flexible and changeablely when expressing a certain semantic according to different context of co-texts Vocabulary, the abundant corpus ultimately generated.It is as follows that text repeats algorithm specific manifestation:
Algorithm 2:Text Paraphrases Method
Input: initial comment collection-IR, adjective-ADJ, adverbial word-ADV
Output: text FR is repeatedPa
For each node e in s
(1) part of speech judgement is carried out to word token therein, to the adjective, adverbial word, verb structure token judged
(2) if word belongs to synonym dictionary Syn, which marks the process of translate and retranslate
(3) obtain token reports word
(4) it by carrying out the judgement of cosine similarity, will be replaced with former token apart from nearest word
END For
" template " refers to abstract expression that is extensive from phrase, sentence these natural languages and coming.Just because of template and phase The example answered, which is compared, stronger representativeness, therefore is widely used in the research work of spatial term.One template It is made of template word (pattern words) and template slot (pattern slots) two parts, wherein template word can be considered template Constant part, template slot is then considered as the variable part of template.The statistical induction from the corpus for the syntax gauge largely collected The fixed template of form out, according to the matching degree of input item and template, to determine the different instances generated.The same template can be with It is instantiated as a variety of examples, thus more rich language material library.Algorithm based on model customization is specifically expressed as follows
Algorithm 3:Template-based Text Customization Method
Input: the comment collection FR for completion of classifying, template library TR
Output: the comment of addition customization text and FRim
For sent in FR:
Type () function is for extracting text corresponding types in FR comment collection
Corresponding templates type set T in template library TR is extracted
By the template type t' for judging sentence sent
(4) t' template slot and corresponding part of speech sequence are obtained,
(5) carried out with template t it is corresponding, use translate function carry out template groove location displacement
END For
It is verified as follows for technical effect of the invention:
Quality evaluation is carried out to model using accuracy rate (Precision) and recall rate (Recall).In such situation Under, the ratio of the comment item number that the accuracy rate machine that then method of calculating with finger detected generates and all comment item numbers that detected, It is the precision ratio for measuring experimental result.Recall rate is then that the comment item number that the machine that algorithm detected generates and all machines produce The ratio of raw comment item number, measures the recall ratio of experimental result.Accuracy rate has more for the comment that assessment algorithm detected It is the false comment that machine generates less, how many falseness that recall rate is used to assess all machines generations commented on and be retrieved. F1Value indicates the harmomic mean of accuracy rate and recall rate, as accuracy rate with recall rate is lower means comment caused by machine It is more difficult to be detected.Therefore, in order to assess the authenticity of false comment generated, therefore accuracy rate, recall rate and F1 The lower value the better.
The present invention is different with the research of previous text generation, in terms of having carried out English sentence structure before the text generation To compare experiment, i.e., in order to prove the validity of addition this link of text classification both of which is arranged in the link of classification Control its dependent variable it is all the same in the case where, model one be comment text generate experiment before without text classification grasp Make, model two is that text classification operation is carried out before comment text generates experiment, and the F under both of which is illustrated in Fig. 31 Value, from the results, it was seen that the F for the model being added after text classification operation1Value is significantly lower than the model that do not classify in advance, To illustrate the indispensable property of classification link.Processing is replaced into through text without the IR of text replacement processing in each field FR afterwards is compared, as shown in figure 4, it can be found that after text-processing F1Value is greatly reduced, it is seen that text-processing pair The performance of generated text plays huge improvement result.
Experimental data set using disclosed experimental data set (commenting on collection in the restaurant on the website Yelp) with this research (comment on Twitter platform) as a comparison, what Fig. 5 was indicated is that this model learns Yelp platform and Twitter platform respectively On comment accuracy rate the results show that, it is evident that two figures are almost without difference from figure, and accuracy rate is all relatively low, from And more illustrating this model has stronger cross-platform adaptability.
Many comments on website are simply to be copied into thousand up to a hundred to drive public opinion direction, or to part duplication Comment, which is modified, forms a new comment.It is team's crime that these comments, which are easy to find out at a glance, takes great pains to build up, carrys out band Dynamic public opinion.The comment of these massive duplications is easy to the comment for being defined as being not trusted.Thus by based on K-gram's Winnowing plagiarizes duplicate checking technology, and the part in the FR and database of this patent in Twitter platform is really commented on and is compared Compared with carrying out duplicate checking detection with true comment in the two and database, the repetitive rate under different samples finally obtained, such as Fig. 6 institute Show.True comment is stablized 0.08 or so since there is no plagiarizing the phenomenon that plagiarizing, will not with sample increase or subtract Originally fluctuation less, the FR of this research increase with the increase repetitive rate of sample, repetitive rate declines when sample rate is 0.5, When sample rate is 0.8, it is lower than 0.08, it can be seen that the FR repetitive rate of this research is lower than true comment.

Claims (5)

1. a kind of social networks based on LSTM comments on generation method, which is characterized in that the described method includes:
Classify to comment text, comment text is divided into seven kinds of classifications: principal series table structure, compares level structure, interrogative sentence knot Structure, exclamative sentence structure, Subject, Predicate and Object structure, Subject, Predicate and Object guest mend structure, imperative sentence structure;For different classifications, design different LSTM model obtains Probability Structure by the study to each LSTM model, generates different classes of initial comment IRi, inferior horn Mark i indicate seven kinds of classifications, i be equal to 1,2,3 ... 7;
According to it is different classes of itself the characteristics of, formulate corresponding text-processing strategy to correct corresponding LSTM model, in turn It generates and really comments on consistent high quality reviews text FR with social networksi
A specific area D is given under social networks W, the hot topic collection which includes is combined into T={ T1,T2,…,Tn, choosing Fixed a certain topic Ti, for topic TiA certain specific main patch P is selected, the comment text collection under main patch P is crawled, is expressed as RR= {RR1,RR2,…,RRn, by classification processing, different classes of data are filtered out, are denoted as FlR={ FlR1, FlR2,…,FlRn, it is separately input in the LSTM model of different parameters, generates the initial comment collection of respective classes, be expressed as IR ={ IR1,IR2,…,IRn};For different classes of feature respectively to IRiDifferent strategies is formulated to carry out drift correction, is given birth to FR={ FR is expressed as at final comment collection1,FR2,…,FRn, as shown in formula (1) to (3);Wherein i ∈ { 1,2,3 ..., n }, N is equal to 7, and wherein C function represents the text classification process based on Random Forest model, and it is raw that function h represents the text based on LSTM At process, function zjRepresent drift correction process;Drift correction includes text replacement, text is repeated and three kinds of plans of model customization Slightly;
C(RR)→FlRi (1)
h(W,D,T,P,FlRi)→IRi (2)
zj(IRi)→FRi j∈{1,2,3} (3)。
2. a kind of social networks based on LSTM according to claim 1 comments on generation method, which is characterized in that
The process classified based on Random Forest model to comment text are as follows:
Data set is created by crawler first, is segmented, part-of-speech tagging, sentence dividing processing,
Secondly the operation that feature extraction and extraction are carried out to text, obtains the feature vector for indicating text, and input is random gloomy In woods model, the output of text classification is obtained by random forest training, comment text is finally divided into following seven class: principal series Table structure compares level structure, interrogative sentence structure, exclamative sentence structure, Subject, Predicate and Object structure, Subject, Predicate and Object guest benefit structure, imperative sentence structure;
Feature Dimension Reduction is carried out using feature selecting and feature extraction in stating feature extraction, by the feature of selection identifier table Show, the specific features of extraction are as follows:
(1) Word embedding+tf-idf feature
Vectorization is carried out to each word by selecting word embedding vector obtained by CBOW model training;CBOW Objective function such as formula (4) shown in;For word wt, word context is Context (wt)={ wt-b,...,wt-1, wt+1,...,wt+b};Wherein, b constant is used to determine the contextual window size of word, window size b=4, final vectorization Word dimension is determined as 120;Tf-idf feature selection approach is introduced simultaneously to examine each word embedding vector Amount, obtains word embedding+tf-idf feature, is named as Wetfidf;
(2) WfreMatrix (word frequency matrix) feature
WfreMatrix feature is used to indicate frequency matrix, and row indicates the number of article, column indicate in comment text it is all go out The word now crossed, the element in matrix are the frequency that word occurs;Word frequency statistics are had been calculated by the interface that sklearn is provided At;
(3) Pos (Part of speech) feature
Part-of-speech tagging is carried out to comment text using pos_tag_sents function in NLTK tool;
(4) Key feature
Key feature is used to indicate that keyword feature, keyword are the words for referring to represent a classification, and Key calculates such as formula (5),
(5) Index feature
Index feature is used to indicate the feature of word position sequence;
(6) Punc feature
Punc feature is used to indicate punctuation mark feature, be counted by the position sequence to fixed punctuation mark sequence to extract Punc feature.
3. a kind of social networks based on LSTM according to claim 2 comments on generation method, which is characterized in that for not Same classification, designs different LSTM models, obtains Probability Structure by the study to each LSTM model, generates different classes of Initial comment IRi, process are as follows:
Building is encoded-is decoded the text generation model of structure based on LSTM, is given a Twitter and is commented on short text, right first The text of input is encoded, in decoding stage, using the probability distribution of the candidate characters of the LSTM different moments generated one by one, The character of next appearance is selected and determined by a kind of reasonable sampling techniques again, the character string of composition is constituted to defeated Enter the natural language description of semantic item;Coding side and decoding end are made of single layer LSTM;About one is generated in coding stage Literary vector C, input of this vector as decoding stage, final sequence data is exported in decoding end.
4. a kind of social networks based on LSTM according to claim 3 comments on generation method, which is characterized in that according to not It is generic itself the characteristics of, formulate corresponding text-processing strategy to correct corresponding LSTM model, so generate with it is social Network really comments on consistent high quality reviews text FRi, process are as follows:
For a topic TiItself collects relatively comprehensive priori knowledge, i.e. domain knowledge, for unrelated or related to theme Property low comment, and the comment runed counter to the fact carries out drift correction processing, and drift correction processing includes text replacement, text It repeats and based on three kinds of drift correction processing of model customization, referred to as the text drift correction based on domain knowledge.
5. a kind of social networks based on LSTM according to claim 4 comments on generation method, which is characterized in that
The process description of text replacement algorithm:
(1) descriptor C is given, corresponding reference data set F is chosen by descriptor C;
(2) found out inside reference data set F all related with descriptor C, and topic relativity is greater than the word of threshold value, composition One set P;The determination method of set P is carried out by dictionary wordNet, by the subordinate relation, the Cheng Yuanguan that choose descriptor System, implication relation extract Candidate Set, by comparing the similitude sim and threshold value k of each word p in descriptor C and P, obtain To final Candidate Set P;
(3) noun similar with C is found out with second step inside the initial comment collection IR of generation, forms a set Q;For Q In a word, score according to the degree of correlation, with the word in P come random replacement;
The process description of text repetition algorithm:
For any sentence s ∈ IR, part of speech judgement is carried out to word token therein, to the adjective judged, adverbial word, is moved Word structure token, if word belongs to synonym dictionary Syn, the mistake of the token memory translate and retranslate Journey, so that obtain token reports word, by carrying out the judgement of cosine similarity, by with former token apart from nearest word into Row replacement, to obtain the repetition text FRpa for repeating algorithm based on text;
Process description based on model customization algorithm:
Text corresponding types in FR comment collection are extracted using type () function, to interrogative sentence in comment collection and exclamative sentence class Type extracts corresponding templates type set T in template library TR;By judging the template type t' of sentence sent, can obtain It to t' template slot and corresponding part of speech sequence, is carried out with template t corresponding, translate function is used to carry out the tune of template groove location It changes, realizes the generation of the customization text sent' based on template.
CN201910680645.7A 2019-07-25 2019-07-25 A kind of social networks comment generation method based on LSTM Pending CN110390018A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910680645.7A CN110390018A (en) 2019-07-25 2019-07-25 A kind of social networks comment generation method based on LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910680645.7A CN110390018A (en) 2019-07-25 2019-07-25 A kind of social networks comment generation method based on LSTM

Publications (1)

Publication Number Publication Date
CN110390018A true CN110390018A (en) 2019-10-29

Family

ID=68287434

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910680645.7A Pending CN110390018A (en) 2019-07-25 2019-07-25 A kind of social networks comment generation method based on LSTM

Country Status (1)

Country Link
CN (1) CN110390018A (en)

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078888A (en) * 2019-12-20 2020-04-28 电子科技大学 Method for automatically classifying comment data of social network users
CN111126063A (en) * 2019-12-26 2020-05-08 北京百度网讯科技有限公司 Text quality evaluation method and device
CN111221940A (en) * 2020-01-03 2020-06-02 京东数字科技控股有限公司 Text generation method and device, electronic equipment and storage medium
CN111541910A (en) * 2020-04-21 2020-08-14 华中科技大学 Video barrage comment automatic generation method and system based on deep learning
CN113033179A (en) * 2021-03-24 2021-06-25 北京百度网讯科技有限公司 Knowledge acquisition method and device, electronic equipment and readable storage medium
CN113705227A (en) * 2020-05-21 2021-11-26 中国科学院上海高等研究院 Method, system, medium and device for constructing Chinese non-segmented word and word embedding model
CN113743086A (en) * 2021-08-31 2021-12-03 北京阅神智能科技有限公司 Chinese sentence evaluation output method
CN114429403A (en) * 2020-10-14 2022-05-03 国际商业机器公司 Mediating between social network and payment curation content producers in false positive content mitigation
CN114443809A (en) * 2021-12-20 2022-05-06 西安理工大学 Hierarchical text classification method based on LSTM and social network
CN114510649A (en) * 2022-02-25 2022-05-17 西安理工大学 Social network and LSTM model accuracy rate calculation method based on de-duplication sample
CN114510924A (en) * 2022-02-14 2022-05-17 哈尔滨工业大学 Text generation method based on pre-training language model
CN114707489A (en) * 2022-03-29 2022-07-05 马上消费金融股份有限公司 Method and device for acquiring marked data set, electronic equipment and storage medium
CN117807963A (en) * 2024-03-01 2024-04-02 之江实验室 Text generation method and device in appointed field
CN113033179B (en) * 2021-03-24 2024-05-24 北京百度网讯科技有限公司 Knowledge acquisition method, knowledge acquisition device, electronic equipment and readable storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160180838A1 (en) * 2014-12-22 2016-06-23 Google Inc. User specified keyword spotting using long short term memory neural network feature extractor
CN107491531A (en) * 2017-08-18 2017-12-19 华南师范大学 Chinese network comment sensibility classification method based on integrated study framework

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160180838A1 (en) * 2014-12-22 2016-06-23 Google Inc. User specified keyword spotting using long short term memory neural network feature extractor
CN107491531A (en) * 2017-08-18 2017-12-19 华南师范大学 Chinese network comment sensibility classification method based on integrated study framework

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
YU TAI等: "Automatic Generation of Review Content in Specific domain of social network based on RNN", 《IEEE》 *
张文宇,李栋: "《物联网智能技术》", 31 December 2012 *
蓝翔: "采用统计机器翻译模型的复述生成技术研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111078888A (en) * 2019-12-20 2020-04-28 电子科技大学 Method for automatically classifying comment data of social network users
CN111078888B (en) * 2019-12-20 2021-12-10 电子科技大学 Method for automatically classifying comment data of social network users
CN111126063A (en) * 2019-12-26 2020-05-08 北京百度网讯科技有限公司 Text quality evaluation method and device
CN111126063B (en) * 2019-12-26 2023-06-20 北京百度网讯科技有限公司 Text quality assessment method and device
CN111221940A (en) * 2020-01-03 2020-06-02 京东数字科技控股有限公司 Text generation method and device, electronic equipment and storage medium
CN111541910A (en) * 2020-04-21 2020-08-14 华中科技大学 Video barrage comment automatic generation method and system based on deep learning
CN111541910B (en) * 2020-04-21 2021-04-20 华中科技大学 Video barrage comment automatic generation method and system based on deep learning
CN113705227B (en) * 2020-05-21 2023-04-25 中国科学院上海高等研究院 Method, system, medium and equipment for constructing Chinese word-segmentation-free word embedding model
CN113705227A (en) * 2020-05-21 2021-11-26 中国科学院上海高等研究院 Method, system, medium and device for constructing Chinese non-segmented word and word embedding model
CN114429403A (en) * 2020-10-14 2022-05-03 国际商业机器公司 Mediating between social network and payment curation content producers in false positive content mitigation
CN113033179A (en) * 2021-03-24 2021-06-25 北京百度网讯科技有限公司 Knowledge acquisition method and device, electronic equipment and readable storage medium
CN113033179B (en) * 2021-03-24 2024-05-24 北京百度网讯科技有限公司 Knowledge acquisition method, knowledge acquisition device, electronic equipment and readable storage medium
CN113743086A (en) * 2021-08-31 2021-12-03 北京阅神智能科技有限公司 Chinese sentence evaluation output method
CN114443809A (en) * 2021-12-20 2022-05-06 西安理工大学 Hierarchical text classification method based on LSTM and social network
CN114443809B (en) * 2021-12-20 2024-04-09 西安理工大学 Hierarchical text classification method based on LSTM and social network
CN114510924A (en) * 2022-02-14 2022-05-17 哈尔滨工业大学 Text generation method based on pre-training language model
CN114510649A (en) * 2022-02-25 2022-05-17 西安理工大学 Social network and LSTM model accuracy rate calculation method based on de-duplication sample
CN114510649B (en) * 2022-02-25 2024-04-09 西安理工大学 Social network and LSTM model accuracy calculating method based on deduplication sample
CN114707489A (en) * 2022-03-29 2022-07-05 马上消费金融股份有限公司 Method and device for acquiring marked data set, electronic equipment and storage medium
CN114707489B (en) * 2022-03-29 2023-08-18 马上消费金融股份有限公司 Method and device for acquiring annotation data set, electronic equipment and storage medium
CN117807963A (en) * 2024-03-01 2024-04-02 之江实验室 Text generation method and device in appointed field
CN117807963B (en) * 2024-03-01 2024-04-30 之江实验室 Text generation method and device in appointed field

Similar Documents

Publication Publication Date Title
CN110390018A (en) A kind of social networks comment generation method based on LSTM
Gokulakrishnan et al. Opinion mining and sentiment analysis on a twitter data stream
Cappallo et al. New modality: Emoji challenges in prediction, anticipation, and retrieval
CN109829166B (en) People and host customer opinion mining method based on character-level convolutional neural network
Aragón et al. Overview of MEX-A3T at IberLEF 2020: Fake News and Aggressiveness Analysis in Mexican Spanish.
Barsever et al. Building a better lie detector with BERT: The difference between truth and lies
CN112905739B (en) False comment detection model training method, detection method and electronic equipment
CN114936266A (en) Multi-modal fusion rumor early detection method and system based on gating mechanism
CN105869058B (en) A kind of method that multilayer latent variable model user portrait extracts
CN112966117A (en) Entity linking method
Maynard et al. Multimodal sentiment analysis of social media
Yu et al. BCMF: A bidirectional cross-modal fusion model for fake news detection
Pham Transferring, transforming, ensembling: the novel formula of identifying fake news
CN116737922A (en) Tourist online comment fine granularity emotion analysis method and system
Scola et al. Sarcasm detection with BERT
Bölücü et al. Hate Speech and Offensive Content Identification with Graph Convolutional Networks.
CN113220964B (en) Viewpoint mining method based on short text in network message field
CN113704393A (en) Keyword extraction method, device, equipment and medium
CN114372454A (en) Text information extraction method, model training method, device and storage medium
Kavatagi et al. A context aware embedding for the detection of hate speech in social media networks
Hamed et al. DISINFORMATION DETECTION ABOUT ISLAMIC ISSUES ON SOCIAL MEDIA USING DEEP LEARNING TECHNIQUES
Wang et al. Using ALBERT and Multi-modal Circulant Fusion for Fake News Detection
Li et al. Multilingual toxic text classification model based on deep learning
Lan et al. Mining semantic variation in time series for rumor detection via recurrent neural networks
Upadhyaya et al. Food Items Prediction Using Sentimental Analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination