CN109948031A - On-Line review sentence automatic creation system with Sentiment orientation - Google Patents
On-Line review sentence automatic creation system with Sentiment orientation Download PDFInfo
- Publication number
- CN109948031A CN109948031A CN201910191363.0A CN201910191363A CN109948031A CN 109948031 A CN109948031 A CN 109948031A CN 201910191363 A CN201910191363 A CN 201910191363A CN 109948031 A CN109948031 A CN 109948031A
- Authority
- CN
- China
- Prior art keywords
- sentence
- user
- sentiment orientation
- sentiment
- line review
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The system that the on-Line review sentence with Sentiment orientation is automatically generated invention describes one can automatically generate matched on-Line review sentence according to information such as the keywords and emotion that user provides.The sentence that traditional spatial term method generates is excessively stiff, dull, and scalability is poor, is difficult to adapt to the diction that the mankind increasingly change.The sentence introduced herein automatically generates mechanism, can generate with their own characteristics and have the sentence of Sentiment orientation, has skimmed required for rule-based generated statement originally to the stock of knowledge of semanteme, grammer etc., has been simple and efficient.Whole concept of the invention is, Sentiment orientation classification is carried out to it from network acquisition sentence resource as corpus and using sentiment analysis the relevant technologies first, then search framework is built, the relevant information provided based on user, matching symbol is closed the text of user demand and is showed from a large amount of data, and the sentence of this system favorable expandability and generation more meets the works and expressions for everyday use of people.
Description
Technical field
It is automatically generated the invention belongs to computer application field more particularly to a kind of network comment with Sentiment orientation
Method.
Background technique
In recent years, with the fast development of the technologies such as computer, internet, people can spend largely in work and life
Time is active in a network, and many news are also to learn from network, so compared with meaning is exchanged with friend in real life
See thought, people are more likely to deliver the speech of oneself on network, its speech is allowed to have more influence power.
Spatial term belongs to the cross discipline of artificial intelligence and computational linguistics, and purpose is intended to make machine generation can
With the Human Natural Language of understanding.Spatial term technology has application, such as conversational system, machine translation in many fields
Deng its development can promote the progress in many fields.Spatial term is developed so far scholars and proposes many methods,
In most steadily and surely be also most popular NLG method be rule-based/template method.The Rhetorical Structure Theory of the propositions such as Mann
(RST), it is extended to the theoretical basis for calculating text planning, is the first ancestor of rule-based generation.RST is developed into very much
The basis for the document creation method that scholar proposes, especially for planning various large-scale texts;Sugiyama etc. is directed to former base
In template generator generate language include sometimes about to input user spoken utterances uncorrelated sentence, propose one kind and be based on
The improved method of template, this method is using word filling template most outstanding in user spoken utterances, and use is collected from Twitter
Web grade dependency structure extract related words.There is trainable sentence generator later, the propositions such as Stent are trained
Sentence generator, the all-purpose language knowledge of application field can be adapted to automatically, it has fast and flexible and general but in specific neck
The advantages of high quality output is generated in domain, which can produce the comparable output of generator with MATCH based on template.With
The development of network, the acquisition of data be increasingly easy, the raw new spatial term method based on corpus therewith
It is suggested and is widely applied.Oh and Rudnicky proposes the spatial term method based on corpus, interested in executing
Language described in the domain expert of task models, and generates system utterances at random using the model.Later by this technology
The natural dialogue system that can be worked is integrated into applied to the realization of sentence and the planning of content, and by the component for generating result
In system.They construct the n-gram language model based on word with two corpus, then random generated statement.Although above-mentioned
Traditional natural language generation system is there are also being widely applied, but these systems are there is also some problems, right
The dependence customized by hand is very big, and the sentence generated is very dull, can not adapt to the diction that the mankind increasingly change, and
Generalization ability is poor, cannot scale to the generation for on-Line review sentence.The above method is using the problem of upper maximum at us, on
It states generation system and has ignored effect of the user in sentence generation system, it cannot be by user-driven sentence generated.We
For system mainly towards user, the information that can be targetedly provided according to user generates the sentence for meeting user demand.
Summary of the invention
The present invention is the system for automatically generating the on-Line review sentence with Sentiment orientation, the pass that can be provided according to user
The information such as keyword and emotion automatically generate matched on-Line review sentence.
The sentence that traditional spatial term method generates is excessively stiff, dull, and such methods scalability is poor, is difficult
Adapt to the diction that the mankind increasingly change.Our target is to generate smooth for end user and have personal emotion color
Text.The sentence introduced herein automatically generates mechanism, can generate sentence with their own characteristics and have Sentiment orientation, and skim
Originally the stock of knowledge of semanteme, grammer etc. is simple and efficient required for rule-based generated statement.Our idea is first
Sentiment orientation classification is carried out to it from network acquisition sentence resource as corpus and using sentiment analysis the relevant technologies, it is then sharp
With the thought of search engine, in the case where providing relevant information based on user, searching meets user and needs from a large amount of data
The sentence asked simultaneously shows, and the sentence generated in this way more meets the daily of people.
The present invention provides a kind of mechanism automatically generated with Sentiment orientation on-Line review sentence, the process of whole system is being schemed
It is shown in 1, specifically include the following steps:
Step 1: network crawls data.Using web crawlers technology, based on our demand, we select use simpler
Single focused web crawler.Select microblogging, know, some Top Sites such as the ends of the earth as object is crawled, crawl content as comment
Sentence and number is thumbed up accordingly.In order to maximize the diversity of our sentences, it is subsequent that our networks have crawled 100,000 sentences
Arranging is corpus, it is of course possible to expand crawl quantity as needed.
Step 2: data preparation storage.Web page contents should only extract documentation section therein when storing, and network is commented
The Analects of Confucius sentence will appear that emoji emoticon, picture, forwarding or web page interlinkage etc. be irregular or our unwanted information,
So needing to carry out Regularization to content when crawl, our unwanted information are filtered out, replace format not
The information that can directly retain, for example, for emoticon, we cannot directly save expression to database, but emoticon
It is critically important for the expression of emotion, it is helpful for the subsequent sentiment analysis that we carry out, so cannot for this type of information
It directly filters, emoticon is converted into corresponding emotive language expression and is preserved together with the sentence crawled.Canonical table
Matching rule up to formula is shown in attached drawing 2.
Step 3: sentiment analysis is carried out to corpus sentence.Sentiment analysis is also known as proneness analysis, is to emotion color
Color subjective texts are analyzed, handled, concluded and the process of reasoning.The on-Line review information that we crawl is a large number of users to all
Such as the mood of task, product or the criticism of event representation or commendation, it is based on this, we incline to generate with user's emotion
It to identical text, needs to carry out sentiment analysis to the on-Line review information crawled, generates the final text for meeting user's tendency to filter
This.It is sentence progress sentiment analysis of the relevant technologies to crawl that machine learning is utilized that we, which carry out sentiment analysis, uses card
Side, which examines, carries out feature extraction, and SVM classifier carries out emotional semantic classification, by corresponding sentiment analysis result while sentiment analysis
Database is written.The process of sentiment analysis is shown in the Part III of attached drawing 1.
Step 4: building search framework.Build the search box that can fast and effeciently respond a large number of users Search Requirement
Frame be it is critically important, Lucene as a lower coupling, high efficiency, be easy secondary development outstanding full-text search engine frame
Structure just completes the part of macrooperation amount when index is established when designing search engine, establishes efficient index to document
Library, retrieval when it is high-efficient, speed is fast, so we build our search framework on the basis of Lucene, attached drawing 3 is
The process of Lucene progress full-text search.
Step 5: match statement is obtained based on keyword and emotion information.User provides in system queries interface wants life
At the keyword or central idea of text, and select corresponding Sentiment orientation, keyword that system is provided according to user or
Other text informations and the Sentiment orientation of selection feed back to the matched text of user.
Detailed description of the invention
Fig. 1 is the flow chart that the on-Line review sentence with Sentiment orientation generates system;
Fig. 2 is regular expression matching rule;
Fig. 3 is full-text index architecture diagram;
Specific embodiment
In order to make the objectives, technical solutions, and advantages of the present invention clearer, below in conjunction with the embodiment of the present invention
In attached drawing, the technical scheme in the embodiment of the invention is clearly and completely described.
Whole concept of the invention is to crawl a large amount of network comment from network first, spare as corpus after arrangement,
Sequentially for the sentence in corpus, Judgment by emotion is carried out to it using the algorithm of sentiment analysis, wherein emotion is divided into positive, negative
Face emotion.It is then based on the corpus after arranging above and builds search framework, the information finally inputted according to user, from big data
It is middle to match the on-Line review sentence for being best suitable for user demand.Specifically includes the following steps:
Step 1: network crawls data.
Using web crawlers technology, from microblogging, knows etc. in Top Sites and crawled 100,000 a plurality of network comments in comment
And number is thumbed up accordingly, subsequent arrange is corpus.Web crawlers is a kind of journey that can independently acquire Web page content
Sequence according to system structure and realizes technology, can substantially be divided into universal network crawler, focused web crawler, increment type network and climb
Worm and Deep Web Crawler, based on our demand, we are selected using better simply focused web crawler.The focus mask used
The structure chart Jian Tu1 first part of network crawler.Determination crawls target and obtains initial URL first for we, obtains after page analysis
The link in the page is taken, unwanted link is fallen according to our goal filtering, the new URL that will acquire is added to URL team
The URL that in column, the priority of each URL in queue is then determined with searching algorithm, and selects a priority high every time is carried out
Content crawls, and recycles this process, stopping when can not obtaining new URL.
Step 2: data preparation storage.
When people are when microblogging, knowing etc. that social platforms (especially microblogging) state one's views, it will usually by additional
Some relevant emoji expressions or picture enhance the emotion that oneself speech includes, and other than these texts form expression
It will lead to the change of form when crawl, irregular symbol will affect the normal expression of sentence.It is carried out in later period user
It will cause selection puzzlement when use, so we utilize regularization expression formula when sentence is commented in crawl, utilize
The re modular filtration of python, which falls, expresses irregular expression or picture, only retains the literal expression in comment sentence, and be stored in
Database, Fig. 2, which is illustrated, carries out the matched regular process of canonical filtering.
Step 3: sentiment analysis is carried out to corpus sentence.
Algorithm flow chart such as Fig. 1 Part III.For feature extraction, we use Chi-square Test, and Chi-square Test is
Departure degree between the actual observed value and theoretical implications value of statistical sample, it is inclined between actual observed value and theoretical implications value
The size of chi-square value is just determined from degree, chi-square value is bigger, does not meet more;Chi-square value is smaller, and deviation is smaller, more tends to meet,
If two values are essentially equal, chi-square value is just 0, shows that theoretical value complies fully with.Classification based training using svm classifier,
Svm classifier is the algorithm for solving two classification problems, and concrete implementation is as follows:
1. utilizing the library pandas, from the corpus data marked are read in excel (or other documents), pos has been marked
Data and neg data.
2. segmenting to training data, the result of point good word is stored in list, is then divided to data set, is drawn
It is divided into test set and training set, saves respectively.
3. carrying out feature extraction to the training set after participle using Chi-square Test, the feature specified number is extracted, as life
At the dimension (input that term vector is used as svm model) of term vector
4. we select soft margin SVM, i.e. SVC to be trained using SVM classifier training term vector.By what is trained
Svm model is stored in model.pkl, when carrying out sentence level sentiment analysis below, can be loaded directly into use.
5. being tested trained data using test set, by the label and test set initial data after test
Label is compared calculating, obtains every evaluation index.This step is to carry out assessment needs to our models, be can be ignored.
Step 4: building search framework.
Search framework is built using Lucene, Lucene is a lower coupling, high efficiency, the full text for being easy secondary opening
Search engine framework has very high recall precision using the inverted index structure of height optimization.Fig. 3 illustrates Lucene reality
The process of existing full-text index.It is broadly divided into two parts: the foundation and inquiry of index.Below from the foundation of index and search index
It introduces respectively two parts.
The foundation of index: 1. obtain original document from the corpus that we put in order, including network comment content,
Several and Sentiment orientation is thumbed up accordingly;2. creating document object, the as set in domain, the relevant some first numbers of document are represented
According to, herein we as needed and obtain three domains of data creation: commentText, commentCounts and
Sentiment, this three individually store and are indexed as the not same area of document;3. the content in pair domain parses, that is, by me
The document content text segmentation that creates at a series of individual atoms elements for being referred to as vocabulary units, we will use herein
Chinese word segmentation machine compared the effect of several segmenter, comprehensively consider factor of both cutting speed and accuracy, Wo Menxuan
Segmenter of the mmseg4j as us has been selected, Simple, Complex segmenting method are provided, Forward Maximum Method is all based on
Algorithm.4. creation index.Lucene is to be indexed to vocabulary unit, passes through word using inverted index data structure
Document is found, it is more succinct than traditional indexing means, rapid.
The inquiry of index: 1. search index interfaces need to provide the interface of a search index for user, system are facilitated to connect
With match statement and result is shown by information;2. establishing inquiry.One is created before user entered keyword executes search
Query object can specify Field document domain, the key word of the inquiry etc. to be searched for of inquiry in query object, and query object can give birth to
At specific query grammar.Since QueryParser can have flexible combination, including Boolean logic expression, fuzzy
With etc., so we select to parse query expression using QueryParser;3. executing inquiry.The nearly reality of Lucene is used
When search for, near real-time search (near-real-time) may search for the content of the also non-commit of IndexWriter, and near real-time is searched
The introducing of rope enables the content in system to be quickly indexed and search for, and reduces produced when system submission index operation
Expense.Lucene realizes that near real-time is searched for by this class of NRTManager, by calling directly NRTManager's
MaybeReopen method obtains newest IndexSearcher object to obtain newest index, returns to the document specified number
(i.e. sentence).In order to facilitate the selection of user, we thumb up number according to on-Line review sentence accordingly and determine the priority to return to document,
It thumbs up several most documents and preferentially shows user.
Step 5: match statement is obtained based on keyword and emotion information.
User inputs the pass of the theme or central idea of wanting generated statement at the query interface that search framework provides
Keyword, while we provide the user the option of selection Sentiment orientation, the feelings of the keyword and selection that are provided according to user
Sense tendency, backstage matching are best suitable for the on-Line review sentence of user demand and successively present according to number is thumbed up, and user can therefrom select
It is best suitable for the on-Line review sentence of regard.
Claims (3)
1. the on-Line review sentence automatic creation system according to claim 1 with Sentiment orientation, which is characterized in that including net
Comment sentence crawls, data cleansing storage, sentiment analysis, search framework is built, statement matching generates:
The on-Line review sentence is crawled for establishing data reserve for system, and the on-Line review sentence crawled is as original language material library;
The data cleansing storage is cleaned for the data to original language material library, is filtered invalid information, non-textual information, is replaced
Changing useful information is text formatting, deletes duplicate message, and reduced data is stored to the subsequent use of database;
The sentiment analysis term carries out the analysis of Sentiment orientation to the sentence in corpus, and writes the result into database;
Described search frame is built for building search framework, and establishes full-text index for data in database;
The statement matching generates term and executes inquiry and return the result.After full-text index is established, query interface receives user
Input selection, and corresponding text information is matched according to the Sentiment orientation of the input of user and selection and feeds back to user.
2. the on-Line review sentence automatic creation system according to claim 1 with Sentiment orientation, it is characterised in that system face
To user, by user-driven, the information that can be targetedly provided according to user generates the sentence for meeting user demand.
3. the on-Line review sentence automatic creation system according to claim 1 with Sentiment orientation, it is characterised in that skim
To the stock of knowledge of semanteme, grammer etc. required for traditional rule-based/template generation sentence, aufbauprinciple is understandable, uses
It is simple and efficient.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910191363.0A CN109948031A (en) | 2019-03-12 | 2019-03-12 | On-Line review sentence automatic creation system with Sentiment orientation |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910191363.0A CN109948031A (en) | 2019-03-12 | 2019-03-12 | On-Line review sentence automatic creation system with Sentiment orientation |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109948031A true CN109948031A (en) | 2019-06-28 |
Family
ID=67009908
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910191363.0A Pending CN109948031A (en) | 2019-03-12 | 2019-03-12 | On-Line review sentence automatic creation system with Sentiment orientation |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109948031A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102236722A (en) * | 2011-08-17 | 2011-11-09 | 广州索答信息科技有限公司 | Method and system for generating user comment summaries based on triples |
CN103955451A (en) * | 2014-05-15 | 2014-07-30 | 北京优捷信达信息科技有限公司 | Method for judging emotional tendentiousness of short text |
US20150193426A1 (en) * | 2014-01-03 | 2015-07-09 | Yahoo! Inc. | Systems and methods for image processing |
CN107153641A (en) * | 2017-05-08 | 2017-09-12 | 北京百度网讯科技有限公司 | Comment information determines method, device, server and storage medium |
CN108228794A (en) * | 2017-12-29 | 2018-06-29 | 三角兽(北京)科技有限公司 | Apparatus for management of information, information processing unit and automatically reply/comment method |
-
2019
- 2019-03-12 CN CN201910191363.0A patent/CN109948031A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102236722A (en) * | 2011-08-17 | 2011-11-09 | 广州索答信息科技有限公司 | Method and system for generating user comment summaries based on triples |
US20150193426A1 (en) * | 2014-01-03 | 2015-07-09 | Yahoo! Inc. | Systems and methods for image processing |
CN103955451A (en) * | 2014-05-15 | 2014-07-30 | 北京优捷信达信息科技有限公司 | Method for judging emotional tendentiousness of short text |
CN107153641A (en) * | 2017-05-08 | 2017-09-12 | 北京百度网讯科技有限公司 | Comment information determines method, device, server and storage medium |
CN108228794A (en) * | 2017-12-29 | 2018-06-29 | 三角兽(北京)科技有限公司 | Apparatus for management of information, information processing unit and automatically reply/comment method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Abdullah et al. | SEDAT: sentiment and emotion detection in Arabic text using CNN-LSTM deep learning | |
Mathews et al. | Semstyle: Learning to generate stylised image captions using unaligned text | |
Trilla et al. | Sentence-based sentiment analysis for expressive text-to-speech | |
CN107247702A (en) | A kind of text emotion analysis and processing method and system | |
CN104794169B (en) | A kind of subject terminology extraction method and system based on sequence labelling model | |
Wahid et al. | Cricket sentiment analysis from Bangla text using recurrent neural network with long short term memory model | |
Rani et al. | An efficient CNN-LSTM model for sentiment detection in# BlackLivesMatter | |
CN106886580A (en) | A kind of picture feeling polarities analysis method based on deep learning | |
Zhao et al. | ZYJ123@ DravidianLangTech-EACL2021: Offensive language identification based on XLM-RoBERTa with DPCNN | |
El-Komy et al. | Integration of computer vision and natural language processing in multimedia robotics application | |
CN110442728A (en) | Sentiment dictionary construction method based on word2vec automobile product field | |
Banik et al. | Gru based named entity recognition system for bangla online newspapers | |
CN111694927A (en) | Automatic document review method based on improved word-shifting distance algorithm | |
CN111339772B (en) | Russian text emotion analysis method, electronic device and storage medium | |
Arevalillo-Herráez et al. | On adapting the DIET architecture and the Rasa conversational toolkit for the sentiment analysis task | |
Qudar et al. | A survey on language models | |
Kesarwani | Automatic Poetry Classification Using Natural Language Processing | |
CN109948031A (en) | On-Line review sentence automatic creation system with Sentiment orientation | |
BENBADA et al. | Investigation of the Role of Artificial Intelligence in Developing Machine Translation Quality. Case Study: Reverso Context and Google Translate translations of Expressive and Descriptive Texts. Language Combination: Arabic-English/English-Arabic | |
Zouaoui et al. | Ontological Approach Based on Multi-Agent System for Indexing and Filtering Arabic Docu-ments | |
Agbesi et al. | Multichannel 2D-CNN Attention-Based BiLSTM Method for Low-Resource Ewe Sentiment Analysis | |
Diallo et al. | Offensive Language Detection in Code-Mixed Bambara-French Corpus: Evaluating machine learning and deep learning classifiers | |
Strømsvåg | Exploring the Why in AI: Investigating how Visual Question Answering models can be interpreted by post-hoc linguistic and visual explanations | |
Skënduli | A Novel Sequential Pattern Mining Approach for User Emotion Detection with Application to Real-World Social Networks | |
Imtiaz | A novel Auto-ML Framework for Sarcasm Detection |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190628 |