CN109948031A

CN109948031A - On-Line review sentence automatic creation system with Sentiment orientation

Info

Publication number: CN109948031A
Application number: CN201910191363.0A
Authority: CN
Inventors: 夏正友; 刘庆庆; 刘赛赛
Original assignee: Nanjing University of Aeronautics and Astronautics
Current assignee: Nanjing University of Aeronautics and Astronautics
Priority date: 2019-03-12
Filing date: 2019-03-12
Publication date: 2019-06-28

Abstract

The system that the on-Line review sentence with Sentiment orientation is automatically generated invention describes one can automatically generate matched on-Line review sentence according to information such as the keywords and emotion that user provides.The sentence that traditional spatial term method generates is excessively stiff, dull, and scalability is poor, is difficult to adapt to the diction that the mankind increasingly change.The sentence introduced herein automatically generates mechanism, can generate with their own characteristics and have the sentence of Sentiment orientation, has skimmed required for rule-based generated statement originally to the stock of knowledge of semanteme, grammer etc., has been simple and efficient.Whole concept of the invention is, Sentiment orientation classification is carried out to it from network acquisition sentence resource as corpus and using sentiment analysis the relevant technologies first, then search framework is built, the relevant information provided based on user, matching symbol is closed the text of user demand and is showed from a large amount of data, and the sentence of this system favorable expandability and generation more meets the works and expressions for everyday use of people.

Description

On-Line review sentence automatic creation system with Sentiment orientation

Technical field

It is automatically generated the invention belongs to computer application field more particularly to a kind of network comment with Sentiment orientation Method.

Background technique

In recent years, with the fast development of the technologies such as computer, internet, people can spend largely in work and life Time is active in a network, and many news are also to learn from network, so compared with meaning is exchanged with friend in real life See thought, people are more likely to deliver the speech of oneself on network, its speech is allowed to have more influence power.

Spatial term belongs to the cross discipline of artificial intelligence and computational linguistics, and purpose is intended to make machine generation can With the Human Natural Language of understanding.Spatial term technology has application, such as conversational system, machine translation in many fields Deng its development can promote the progress in many fields.Spatial term is developed so far scholars and proposes many methods, In most steadily and surely be also most popular NLG method be rule-based/template method.The Rhetorical Structure Theory of the propositions such as Mann (RST), it is extended to the theoretical basis for calculating text planning, is the first ancestor of rule-based generation.RST is developed into very much The basis for the document creation method that scholar proposes, especially for planning various large-scale texts；Sugiyama etc. is directed to former base In template generator generate language include sometimes about to input user spoken utterances uncorrelated sentence, propose one kind and be based on The improved method of template, this method is using word filling template most outstanding in user spoken utterances, and use is collected from Twitter Web grade dependency structure extract related words.There is trainable sentence generator later, the propositions such as Stent are trained Sentence generator, the all-purpose language knowledge of application field can be adapted to automatically, it has fast and flexible and general but in specific neck The advantages of high quality output is generated in domain, which can produce the comparable output of generator with MATCH based on template.With The development of network, the acquisition of data be increasingly easy, the raw new spatial term method based on corpus therewith It is suggested and is widely applied.Oh and Rudnicky proposes the spatial term method based on corpus, interested in executing Language described in the domain expert of task models, and generates system utterances at random using the model.Later by this technology The natural dialogue system that can be worked is integrated into applied to the realization of sentence and the planning of content, and by the component for generating result In system.They construct the n-gram language model based on word with two corpus, then random generated statement.Although above-mentioned Traditional natural language generation system is there are also being widely applied, but these systems are there is also some problems, right The dependence customized by hand is very big, and the sentence generated is very dull, can not adapt to the diction that the mankind increasingly change, and Generalization ability is poor, cannot scale to the generation for on-Line review sentence.The above method is using the problem of upper maximum at us, on It states generation system and has ignored effect of the user in sentence generation system, it cannot be by user-driven sentence generated.We For system mainly towards user, the information that can be targetedly provided according to user generates the sentence for meeting user demand.

Summary of the invention

The present invention is the system for automatically generating the on-Line review sentence with Sentiment orientation, the pass that can be provided according to user The information such as keyword and emotion automatically generate matched on-Line review sentence.

The sentence that traditional spatial term method generates is excessively stiff, dull, and such methods scalability is poor, is difficult Adapt to the diction that the mankind increasingly change.Our target is to generate smooth for end user and have personal emotion color Text.The sentence introduced herein automatically generates mechanism, can generate sentence with their own characteristics and have Sentiment orientation, and skim Originally the stock of knowledge of semanteme, grammer etc. is simple and efficient required for rule-based generated statement.Our idea is first Sentiment orientation classification is carried out to it from network acquisition sentence resource as corpus and using sentiment analysis the relevant technologies, it is then sharp With the thought of search engine, in the case where providing relevant information based on user, searching meets user and needs from a large amount of data The sentence asked simultaneously shows, and the sentence generated in this way more meets the daily of people.

The present invention provides a kind of mechanism automatically generated with Sentiment orientation on-Line review sentence, the process of whole system is being schemed It is shown in 1, specifically include the following steps:

Step 1: network crawls data.Using web crawlers technology, based on our demand, we select use simpler Single focused web crawler.Select microblogging, know, some Top Sites such as the ends of the earth as object is crawled, crawl content as comment Sentence and number is thumbed up accordingly.In order to maximize the diversity of our sentences, it is subsequent that our networks have crawled 100,000 sentences Arranging is corpus, it is of course possible to expand crawl quantity as needed.

Step 2: data preparation storage.Web page contents should only extract documentation section therein when storing, and network is commented The Analects of Confucius sentence will appear that emoji emoticon, picture, forwarding or web page interlinkage etc. be irregular or our unwanted information, So needing to carry out Regularization to content when crawl, our unwanted information are filtered out, replace format not The information that can directly retain, for example, for emoticon, we cannot directly save expression to database, but emoticon It is critically important for the expression of emotion, it is helpful for the subsequent sentiment analysis that we carry out, so cannot for this type of information It directly filters, emoticon is converted into corresponding emotive language expression and is preserved together with the sentence crawled.Canonical table Matching rule up to formula is shown in attached drawing 2.

Step 3: sentiment analysis is carried out to corpus sentence.Sentiment analysis is also known as proneness analysis, is to emotion color Color subjective texts are analyzed, handled, concluded and the process of reasoning.The on-Line review information that we crawl is a large number of users to all Such as the mood of task, product or the criticism of event representation or commendation, it is based on this, we incline to generate with user's emotion It to identical text, needs to carry out sentiment analysis to the on-Line review information crawled, generates the final text for meeting user's tendency to filter This.It is sentence progress sentiment analysis of the relevant technologies to crawl that machine learning is utilized that we, which carry out sentiment analysis, uses card Side, which examines, carries out feature extraction, and SVM classifier carries out emotional semantic classification, by corresponding sentiment analysis result while sentiment analysis Database is written.The process of sentiment analysis is shown in the Part III of attached drawing 1.

Step 4: building search framework.Build the search box that can fast and effeciently respond a large number of users Search Requirement Frame be it is critically important, Lucene as a lower coupling, high efficiency, be easy secondary development outstanding full-text search engine frame Structure just completes the part of macrooperation amount when index is established when designing search engine, establishes efficient index to document Library, retrieval when it is high-efficient, speed is fast, so we build our search framework on the basis of Lucene, attached drawing 3 is The process of Lucene progress full-text search.

Step 5: match statement is obtained based on keyword and emotion information.User provides in system queries interface wants life At the keyword or central idea of text, and select corresponding Sentiment orientation, keyword that system is provided according to user or Other text informations and the Sentiment orientation of selection feed back to the matched text of user.

Detailed description of the invention

Fig. 1 is the flow chart that the on-Line review sentence with Sentiment orientation generates system；

Fig. 2 is regular expression matching rule；

Fig. 3 is full-text index architecture diagram；

Specific embodiment

In order to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, the technical scheme in the embodiment of the invention is clearly and completely described.

Whole concept of the invention is to crawl a large amount of network comment from network first, spare as corpus after arrangement, Sequentially for the sentence in corpus, Judgment by emotion is carried out to it using the algorithm of sentiment analysis, wherein emotion is divided into positive, negative Face emotion.It is then based on the corpus after arranging above and builds search framework, the information finally inputted according to user, from big data It is middle to match the on-Line review sentence for being best suitable for user demand.Specifically includes the following steps:

Step 1: network crawls data.

Using web crawlers technology, from microblogging, knows etc. in Top Sites and crawled 100,000 a plurality of network comments in comment And number is thumbed up accordingly, subsequent arrange is corpus.Web crawlers is a kind of journey that can independently acquire Web page content Sequence according to system structure and realizes technology, can substantially be divided into universal network crawler, focused web crawler, increment type network and climb Worm and Deep Web Crawler, based on our demand, we are selected using better simply focused web crawler.The focus mask used The structure chart Jian Tu1 first part of network crawler.Determination crawls target and obtains initial URL first for we, obtains after page analysis The link in the page is taken, unwanted link is fallen according to our goal filtering, the new URL that will acquire is added to URL team The URL that in column, the priority of each URL in queue is then determined with searching algorithm, and selects a priority high every time is carried out Content crawls, and recycles this process, stopping when can not obtaining new URL.

Step 2: data preparation storage.

When people are when microblogging, knowing etc. that social platforms (especially microblogging) state one's views, it will usually by additional Some relevant emoji expressions or picture enhance the emotion that oneself speech includes, and other than these texts form expression It will lead to the change of form when crawl, irregular symbol will affect the normal expression of sentence.It is carried out in later period user It will cause selection puzzlement when use, so we utilize regularization expression formula when sentence is commented in crawl, utilize The re modular filtration of python, which falls, expresses irregular expression or picture, only retains the literal expression in comment sentence, and be stored in Database, Fig. 2, which is illustrated, carries out the matched regular process of canonical filtering.

Step 3: sentiment analysis is carried out to corpus sentence.

Algorithm flow chart such as Fig. 1 Part III.For feature extraction, we use Chi-square Test, and Chi-square Test is Departure degree between the actual observed value and theoretical implications value of statistical sample, it is inclined between actual observed value and theoretical implications value The size of chi-square value is just determined from degree, chi-square value is bigger, does not meet more；Chi-square value is smaller, and deviation is smaller, more tends to meet, If two values are essentially equal, chi-square value is just 0, shows that theoretical value complies fully with.Classification based training using svm classifier, Svm classifier is the algorithm for solving two classification problems, and concrete implementation is as follows:

1. utilizing the library pandas, from the corpus data marked are read in excel (or other documents), pos has been marked Data and neg data.

2. segmenting to training data, the result of point good word is stored in list, is then divided to data set, is drawn It is divided into test set and training set, saves respectively.

3. carrying out feature extraction to the training set after participle using Chi-square Test, the feature specified number is extracted, as life At the dimension (input that term vector is used as svm model) of term vector

4. we select soft margin SVM, i.e. SVC to be trained using SVM classifier training term vector.By what is trained Svm model is stored in model.pkl, when carrying out sentence level sentiment analysis below, can be loaded directly into use.

5. being tested trained data using test set, by the label and test set initial data after test Label is compared calculating, obtains every evaluation index.This step is to carry out assessment needs to our models, be can be ignored.

Step 4: building search framework.

Search framework is built using Lucene, Lucene is a lower coupling, high efficiency, the full text for being easy secondary opening Search engine framework has very high recall precision using the inverted index structure of height optimization.Fig. 3 illustrates Lucene reality The process of existing full-text index.It is broadly divided into two parts: the foundation and inquiry of index.Below from the foundation of index and search index It introduces respectively two parts.

The foundation of index: 1. obtain original document from the corpus that we put in order, including network comment content, Several and Sentiment orientation is thumbed up accordingly；2. creating document object, the as set in domain, the relevant some first numbers of document are represented According to, herein we as needed and obtain three domains of data creation: commentText, commentCounts and Sentiment, this three individually store and are indexed as the not same area of document；3. the content in pair domain parses, that is, by me The document content text segmentation that creates at a series of individual atoms elements for being referred to as vocabulary units, we will use herein Chinese word segmentation machine compared the effect of several segmenter, comprehensively consider factor of both cutting speed and accuracy, Wo Menxuan Segmenter of the mmseg4j as us has been selected, Simple, Complex segmenting method are provided, Forward Maximum Method is all based on Algorithm.4. creation index.Lucene is to be indexed to vocabulary unit, passes through word using inverted index data structure Document is found, it is more succinct than traditional indexing means, rapid.

The inquiry of index: 1. search index interfaces need to provide the interface of a search index for user, system are facilitated to connect With match statement and result is shown by information；2. establishing inquiry.One is created before user entered keyword executes search Query object can specify Field document domain, the key word of the inquiry etc. to be searched for of inquiry in query object, and query object can give birth to At specific query grammar.Since QueryParser can have flexible combination, including Boolean logic expression, fuzzy With etc., so we select to parse query expression using QueryParser；3. executing inquiry.The nearly reality of Lucene is used When search for, near real-time search (near-real-time) may search for the content of the also non-commit of IndexWriter, and near real-time is searched The introducing of rope enables the content in system to be quickly indexed and search for, and reduces produced when system submission index operation Expense.Lucene realizes that near real-time is searched for by this class of NRTManager, by calling directly NRTManager's MaybeReopen method obtains newest IndexSearcher object to obtain newest index, returns to the document specified number (i.e. sentence).In order to facilitate the selection of user, we thumb up number according to on-Line review sentence accordingly and determine the priority to return to document, It thumbs up several most documents and preferentially shows user.

Step 5: match statement is obtained based on keyword and emotion information.

User inputs the pass of the theme or central idea of wanting generated statement at the query interface that search framework provides Keyword, while we provide the user the option of selection Sentiment orientation, the feelings of the keyword and selection that are provided according to user Sense tendency, backstage matching are best suitable for the on-Line review sentence of user demand and successively present according to number is thumbed up, and user can therefrom select It is best suitable for the on-Line review sentence of regard.

Claims

1. the on-Line review sentence automatic creation system according to claim 1 with Sentiment orientation, which is characterized in that including net Comment sentence crawls, data cleansing storage, sentiment analysis, search framework is built, statement matching generates:

The on-Line review sentence is crawled for establishing data reserve for system, and the on-Line review sentence crawled is as original language material library；

The data cleansing storage is cleaned for the data to original language material library, is filtered invalid information, non-textual information, is replaced Changing useful information is text formatting, deletes duplicate message, and reduced data is stored to the subsequent use of database；

The sentiment analysis term carries out the analysis of Sentiment orientation to the sentence in corpus, and writes the result into database；

Described search frame is built for building search framework, and establishes full-text index for data in database；

The statement matching generates term and executes inquiry and return the result.After full-text index is established, query interface receives user Input selection, and corresponding text information is matched according to the Sentiment orientation of the input of user and selection and feeds back to user.

2. the on-Line review sentence automatic creation system according to claim 1 with Sentiment orientation, it is characterised in that system face To user, by user-driven, the information that can be targetedly provided according to user generates the sentence for meeting user demand.

3. the on-Line review sentence automatic creation system according to claim 1 with Sentiment orientation, it is characterised in that skim To the stock of knowledge of semanteme, grammer etc. required for traditional rule-based/template generation sentence, aufbauprinciple is understandable, uses It is simple and efficient.