CN111783426A - Long text emotion calculation method based on double-question method - Google Patents

Long text emotion calculation method based on double-question method Download PDF

Info

Publication number
CN111783426A
CN111783426A CN202010613202.9A CN202010613202A CN111783426A CN 111783426 A CN111783426 A CN 111783426A CN 202010613202 A CN202010613202 A CN 202010613202A CN 111783426 A CN111783426 A CN 111783426A
Authority
CN
China
Prior art keywords
sentence
sentences
words
title
topic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010613202.9A
Other languages
Chinese (zh)
Inventor
程南昌
邹煜
杨柳
宋康
滕永林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Communication University of China
Original Assignee
Communication University of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Communication University of China filed Critical Communication University of China
Priority to CN202010613202.9A priority Critical patent/CN111783426A/en
Publication of CN111783426A publication Critical patent/CN111783426A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Abstract

The invention discloses a long text emotion calculation method based on a double-topic method, according to statistics of the method, the proportion of the whole language tendency can be judged to reach 0.61 from a title, and the text emotion calculation can be completed by combining the title and an emotion sentence together, so that the accuracy of the text emotion calculation can be improved to the maximum extent. The accuracy of the method reaches 0.82 in the analysis of long texts in the language and text field.

Description

Long text emotion calculation method based on double-question method
Technical Field
The invention relates to the technical field of long text emotion calculation, in particular to a long text emotion calculation method based on a two-topic method.
Background
The text has wide application in multiple aspects of electronic commerce, business intelligence, information monitoring, civil investigation, electronic learning, newspaper editing, enterprise management and the like. In order to promote the development of text tendency analysis technology, related evaluation activities are carried out internationally and domestically, and a TREC Block Track [ Iadh Ounis et al, 2006 is mainly internationally; CraigMacdonald et al, 2007] and NTCIR [ Yohei Seki, 2008 ]. In China, since 2008, Chinese information society has held Chinese tendency analysis and evaluation (COAE) for many times, and Chinese computer society has also held Chinese microblog emotion analysis and evaluation in 2012 and 8 months.
The analysis difficulty of the tendency of long texts is high, and most of the current emotional tendency related research results are mainly concentrated on sentences or short text levels. Through weighted calculation of three aspects of emotional words, positions and keywords (such as over all, all in all and in my opinions), the emotional key sentences in the English text are obtained by sequencing according to the score, finally, a chapter-level tendency classifier based on the emotional key sentences is established by adopting a Bayesian algorithm, and the experimental result is improved by 2.84 percent compared with the foreign similar classification. The YOU Jianqing obtains the accuracy of 0.71 in the emotional calculation of 150 news chapters through the multi-dimensional characteristic weighting calculation of high-frequency words, positions and the like. At present, some researches adopt a deep learning method, but mainly focus on the sentence or word level, and also process the exploration phase on chapters. For example, Giatsoglou M et al, which uses a word vector model, have certain advantages in word context analysis, but ignore the emotion of the whole text.
The research object of the invention is a language chapter with a title and a complete chapter structure, the space of the language chapter is about 1000 characters generally, and the language chapter is called as a long text; without title, and web comments, short microblogs (140 words), etc., which have arbitrary organization, are called short texts. The analysis difficulty of the tendency of long texts is high, because one article often contains different topics, different topics may contain different emotions, and how to summarize the emotions of chapters from the diversified emotions is a problem worthy of research. .
Disclosure of Invention
The invention aims to provide a long text sentiment calculation method based on a double-topic method, wherein sentiment tendency analysis of a long text (chapters) has higher difficulty, and 2 topic key sentences which can most represent sentiment tendency of the whole language are calculated from an article by a method of weighted calculation of 5 dimensional characteristics such as positions, characteristic words, topic relevance, sentiment words, potential topic sentences and the like.
In order to solve the technical problem, the invention provides a long text emotion calculation method based on a two-topic method, which comprises the following steps of:
step 1: title analysis: the method comprises the following steps of counting data of two aspects of the number of texts of which the tendency of the whole text can be judged from the title and the number of texts of which the whole language 'theme' is reflected in the title from a corpus 1;
step 2: the theme sentence identification specifically comprises:
step 201: inputting a long text, wherein the long text refers to a text with a text space generally in the 900-charge 1100 characters;
step 202: after word segmentation labeling and topic sentence position characteristic labeling are carried out, topic relevance calculation is carried out, the word segmentation labeling comprises characteristic word labeling, and the characteristic words are respectively: single word cue words, double word cue words, suggestive vocabularies, language cue components, fixed structures, words representing summarization and words representing logic sequence; 156 topic sentence characteristic words are extracted from the topic sentence characteristic words and stored in a topic sentence marking word list through analysis of the topic sentence characteristic words; on the basis that 0.7 score is counted for sentences containing the characteristic words, 0.1 score is added for every additional characteristic word, and the highest score is 1 score; the position characteristics of the subject sentences are arranged according to the importance degree of the sentences, the scores of the subject sentences are arranged from high to low in the following order, the first sentence at the last stage is 1, the last sentence at the last stage is 0.95, the first sentence at the second stage is 0.9, the last sentence at the second stage is 0.85, the first sentence at the first stage is 0.8, the last sentence at the first stage is 0.75, the first sentence at the other stage is 0.7, the last sentence at the other stage is 0.65, the other sentence at the last stage is 0.6, the other sentence at the second stage is 0.55, the other sentence at the first stage is 0.5, and the other sentence at the other stage is 0.45; calculating topic relevance, wherein Sim (Title, Sen) represents the similarity between a Title and candidate topic sentences, WN (word number) represents the number of words of the Title, (1/WN) represents the score of each same word, and SWN represents the same number of words; CN (character number) represents the word number of the title, (1/CN) represents the score of each same word, and SCN represents the same word number; the title and sentence similarity calculation formula is as follows:
Figure BDA0002562856970000031
the simplified formula is as follows:
Figure BDA0002562856970000032
step 203: emotion marking and phrase emotion value calculation, and automatic word segmentation and emotion marking of input sentences by using a basic emotion dictionary of the CUCsas system; counting the sentence containing the emotional words by 0.7, and adding 0.1 to each added emotional word on the basis of the counting, wherein the maximum is 1;
step 204: calculating a sentence emotion value and outputting the sentence emotion value; calculating the sentence sentiment value means judging whether a potential subject sentence exists, and all the potential subject sentences which are qualified through verification are divided into 1; the five factors of position, topic relevance, emotional words, characteristic words and potential topic sentences are abbreviated as LO, TR, EW, FW and PT respectively; the formula for the topic sentence weighting calculation is as follows:
TopSenScore=(FW+Sim(Title,Sen)+EW)*2+LO*3+PT
and step 3: calculating the polarity of the long text, extracting two subject sentences from the text of the text, adding the subject sentences and the title together to obtain 3 sentences, and respectively calculating the emotional tendency polarities of the three sentences as follows: positive, negative, neutral, their tendency polarity is in total three cases: all 3 sentences have completely the same emotional tendency, and the article tendency can be represented by taking any sentence tendency; 3 sentences have completely different tendencies and take the tendency of titles; two sentences have the same tendency, and the sentences with the same tendency are taken.
Compared with the prior art, the text tendency is calculated by combining the title and the theme sentence, the difficulty is mainly in the identification precision of the theme sentence, the proportion of the whole text tendency can be judged to reach 0.61 from the title according to the statistics of the text emotion calculating method, and the text emotion calculating accuracy can be improved to the maximum extent by combining the title and the emotion sentence to finish the text emotion calculating. The accuracy of the method reaches 0.82 in the analysis of long texts in the language and text field. In view of the difficulty of long text analysis, the invention considers that in a practical system, the invention can be used for specifically analyzing specific problems and attacking one by one according to text characteristics in specific fields.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a long text emotion calculation method based on a two-topic method, which comprises the following steps of:
step 1: title analysis: the corpus 1 is counted with data of two aspects of the number of texts of which the tendency of the whole text can be judged from the title and the number of texts of which the whole language of 'subject' is reflected in the title.
Step 2: and (3) identifying a subject sentence:
step 201: inputting a long text, wherein the long text refers to a text with a text space generally in the 900-charge 1100 characters;
step 202: after word segmentation labeling and topic sentence position characteristic labeling are carried out, topic relevance calculation is carried out, the word segmentation labeling comprises characteristic word labeling, and the characteristic words are respectively: single word cue words, double word cue words, suggestive vocabularies, language cue components, fixed structures, words representing summarization and words representing logic sequence; 156 subject sentence characteristic words are extracted from the subject sentence characteristic words and stored in a subject sentence marking word list through the analysis of the subject sentence characteristic words. And on the basis that 0.7 score is calculated for each sentence containing the characteristic words, 0.1 score is added for each added characteristic word, and the highest score is 1. The position characteristics of the subject sentences can be arranged in the order of scores from high to low according to the importance degree of the sentences, namely, the last sentence (1 score), the last sentence (0.95 score), the second sentence (0.9 score), the second sentence (0.85 score), the first sentence (0.8 score), the first sentence (0.75 score), the other first sentence (0.7 score), the other sentence (0.65 score), the last sentence (0.6 score), the second sentence (0.55 score), the first sentence (0.5 score) and the other sentence (0.45 score). The degree of correlation of the subject is calculated,
sim (Title, Sen) represents the similarity between the Title and the candidate subject sentence, WN (word number) represents the number of words of the Title, (1/WN) represents the score of each same word, and SWN represents the same number of words; CN (character number) indicates the number of words of the title, (1/CN) indicates the score of the same word per phase, and SCN indicates the same number of words. The title and sentence similarity calculation formula is as follows:
Figure BDA0002562856970000051
the simplified formula is as follows:
Figure BDA0002562856970000052
step 203: emotion marking and phrase emotion value calculation, and automatic word segmentation and emotion marking of input sentences by using a basic emotion dictionary of the CUCsas system; and counting 0.7 points for sentences containing emotional words, and adding 0.1 point to each added emotional word on the basis of the 0.7 points, wherein the highest point is 1 point.
Step 204: calculating a sentence emotion value and outputting the sentence emotion value; the sentence emotion value is calculated by judging whether a potential subject sentence exists or not, and all the potential subject sentences which are qualified through verification are scored as 1. Five factors of position, topic relevance, emotional words, characteristic words and potential topic sentences are abbreviated as LO, TR, EW, FW and PT respectively. The formula for the topic sentence weighting calculation is as follows:
TopSenScore=(FW+Sim(Title,Sen)+EW)*2+LO*3+PT
and step 3: calculating the long text to polarity, extracting two subject sentences from the text of the text, adding the subject sentences and the title to form 3 sentences, and calculating the emotional tendency polarities (positive, negative and neutral) of the three sentences respectively, wherein the tendency polarities of the three sentences are three cases: all 3 sentences have completely the same emotional tendency, and the article tendency can be represented by taking any sentence tendency; 3 sentences have completely different tendencies and take the tendency of titles; two sentences have the same tendency, and the sentences with the same tendency are taken.
The step 1 specifically comprises the following steps:
1 Main concept
The title is the embodiment of the theme of the article and plays an important role in judging the tendency of the whole text, but the judgment of the tendency of the text by only the title is obviously insufficient. On the one hand, objectively some titles do not clearly show the viewpoint tendency, and on the other hand, in terms of the understanding of the reader, the titles can guide the reader to understand the viewpoint tendency of the article to some extent, but finally grasp the viewpoint tendency of the text by reading the text content. The invention discovers that not all sentences contribute to the tendency of the article through the characteristic analysis of the language, and can determine the polarity of the whole article, namely one or more key sentences. The expression characteristics of the sentences are connected with the chapter structure characteristics, such as the expression of the expression, the expression and the like, which are often found at the beginning and the end of an article or a paragraph. If the expression characteristics of the sentence on the text can be captured and effectively recognized, the tendency analysis of the chapters can be reduced to the sentence level, and the tendency of the text can be judged through the weighted calculation of the sentence.
2 title feature analysis
Although the title is short, the title is a 'literary sketch' of the article, and the theme of the whole article and the emotional mood of the author can be reflected in many times. The invention counts the data of two aspects of the number of texts which can judge the tendency of the whole text from the title and the number of texts of which the whole language 'theme' is reflected in the title from the corpus 1. The following table 1 is a survey of the title tendencies and the characteristics of the subjects:
Figure BDA0002562856970000061
as can be seen from Table 1, the title reflects the 92% of the theme of the article, and the title can judge that the text with the whole text tendency accounts for 61%, which is sufficient for the importance of the title.
Step 2:
3 subject sentence recognition method
The topic sentence is a sentence containing a text topic concept, which is an important carrier of a text center thought and is also a centralized embodiment of text content. The method adopts 5 dimension weighting calculations of characteristic words, positions, emotional words, topic relevance and potential topic sentences to identify the topic sentences, and the initial score of each dimension is 1 score at most.
3.1 characteristic words
The text extracts marked subject sentences from the corpus 1, and induces seven feature words of the subject sentences, wherein the feature words are as follows: single word cues, double word cues, suggestive vocabularies, linguistic cue components, fixed structures, words representing summarization, and words representing logical order.
(1) Single word prompting word
The word-word cues often appearing in the subject sentence mainly include "say, call, think, see", etc., which often co-occur with the evaluation subject, whose expression is generally a person's name, an organization name, etc., representing a certain point of view, and which often has a comma thereafter. Such as: the article states that a small dictionary fully reflects the transition of China times, and the socioeconomic changes of China are reflected in the small dictionary.
(2) Double-character prompt word
The commonly appearing double-word prompts in the subject sentence mainly include "think, express, feel, point out, suggest, call to" and the like, and they generally appear together with the subject of evaluation and the object, and there are commas, double quotations, colons and the like in the latter. Such as: experts point out that the existing Chinese textbook has four defects, namely classic defect, defect of children visual angle, defect of joy and defect of reality.
(3) Suggestive vocabulary
Suggestive vocabularies that often appear in subject sentences are mainly "in fact", "clearly", "for which, unquestionably, with ease, also that sentence, etc., often followed by a comma. Such as: in fact, the creativity and vitality presented by the network hotword have proven themselves worth.
(4) Wen language indicating ingredient
The prompting component of a part of theme sentences in the field of language character public opinion is a language word, which is very characteristic. For example, "a blind, a cloud, a stolen, a light, no need/no doubt, a modern design, and the like. (5) Fixed structure
Some fixed structures are also commonly used in the subject sentence, mainly "from/about/press/in … …/say/see", "from/about/press/in … …/say/see", "for/for … …", "look at … …", "if … …", "just like … …" etc. For example, for Taboada, it is very difficult for Argentina entrepreneurs to make business in China without knowing the Confucius and the cultural tradition of China.
(6) Words representing summary, abstract aspects
The invention collects the common words with the general meaning, and mainly comprises the words in summary, overall view, all in all, and the like. Such as: in a word, the quality of Chery cars is still reassuring.
(7) Words representing a logical order
In order to express the self view more orderly, some language words and public sentiment articles use ordinal words or other paraphrases for representing layers when representing logical sequence, and the sentences are usually subject sentences.
Finally, 156 topic sentence characteristic words are extracted from the topic sentence characteristic words and stored in a topic sentence marking word list through analysis of the topic sentence characteristic words. And on the basis that 0.7 score is calculated for each sentence containing the characteristic words, 0.1 score is added for each added characteristic word, and the highest score is 1.
3.2 topical sentence position features
In addition to the feature words, the subject sentences have obvious features in position, the features can also provide important clues for identification of the subject sentences, all the subject sentences are extracted from the corpus 1, and the total number of the subject sentences is 1701, and the results after statistical analysis are as follows.
(1) The subject sentence is counted at the first sentence, the last sentence and other positions of the paragraph, and the result is shown in table 2.
TABLE 2 distribution of subject sentences in the first, last, and other positions of each paragraph
Figure BDA0002562856970000081
Table 2 shows that there are 725 cases of topic sentences in the first paragraph, 512 cases of topic sentences in the last paragraph, and 464 cases of topic sentences in other positions of the paragraph, which account for only 27% of the total proportion. It can be seen that the subject sentences are mainly distributed in the text between the first sentences and the last sentences of each paragraph, wherein the first sentences are more than the last sentences.
(2) The distribution of the subject sentences in the first paragraph, the second paragraph, the last paragraph and other paragraphs is shown in table 3.
TABLE 3 distribution of subject sentences in paragraphs
Figure BDA0002562856970000082
Table 3 shows that 121 cases of the subject sentences appearing in the first paragraph account for only 7%, 196 cases of the subject sentences appearing in the second paragraph account for 12%, 492 cases of the subject sentences appearing in the last paragraph account for 29%, and 892 cases of the subject sentences appearing in the other paragraphs account for 52% of the total proportion. It can be seen that although the topic sentences are distributed uniformly in each paragraph of the text, the topic sentences distributed in the last paragraph are more than that, and reach a ratio of 29%, and then the second paragraph of the text, the topic sentences reach a ratio of 12%, and the topic sentences distributed in the first paragraph also occupy a ratio of 7%.
According to the statistical result, the sentences are divided into three categories, namely first sentences, last sentences and other sentences; paragraphs are divided into four categories, first, second, last, and others. According to the importance degree of sentences, the scores can be arranged in the order from high to low, namely, the last paragraph first sentence (1 score), the last paragraph last sentence (0.95 score), the second paragraph first sentence (0.9 score), the second paragraph last sentence (0.85), the first paragraph first sentence (0.8), the first paragraph last sentence (0.75), the other paragraph first sentence (0.7), the other paragraph last sentence (0.65), the last paragraph other sentence (0.6), the second paragraph other sentence (0.55), the first paragraph other sentence (0.5) and the other paragraph other sentence (0.45).
3.3 topic relevance computation
According to the statistics of the foregoing, 92% of titles can reflect article topics, and based on this, the present invention calculates the relevance of a sentence to a topic by the similarity of the sentence to the title. Two articles are described below as examples.
Example 1: title: road public transport try Suzhou call stop reporting passenger net friend feeling 'man'[28]
The subject sentence: the language is given inNetCause much over the collateralsNet friendDiscussion, most ofSuzhou passengerMeans "hear the sound really timesFeeling ofAnd (4) cutting. "
Example 2: title: one seeing 'Suzhou dialect subway station' in Su Di "[44]
The subject sentence: as a whole, asGround outsideTo comeSu (Chinese character of 'su')So as to fully understandSuzhou provinceThe moods of the people are in the native mood,using Suzhou Talking stationConvenient riding for native and old people who are unfamiliar with MandarinSubway
The words with similar subject sentences and titles in example 1 are "net friends, suzhou and passengers", and the same words are "net feeling"; the subject sentence in example 2 is similar to the title by the words "foreign, suzhou, used, suzhou dialect, subway" and the same words are "su, newspaper, station". The "stop announcement" is a word separation, and if compared with the word separation, the title and the subject sentence are only two same words.
Since the title has its unique features, it is differentiated from the sentence of the body, and the purpose of similarity comparison is to investigate whether they talk about the same topic, not the complete semantic agreement. Based on the above, the advantages of the two models of the simple common words and the minimum editing distance are comprehensively considered, and certain improvement is performed. Simple common words can play a strong role when the words are identical, but if only a part of the two words are identical, the minimum edit distance can play a role, such as "suzhou" and "susu" above, the similarity is 0 using the simple common word model, and the similarity is higher using the minimum edit distance of 1 using the minimum edit distance.
Therefore, when calculating the similarity, not only how many identical words but also how many identical words are considered. The invention uses: sim (Title, Sen) represents the similarity between the Title and the candidate subject sentence, WN (word number) represents the number of words of the Title, (1/WN) represents the score of each same word, and SWN represents the same number of words; CN (character number) indicates the number of words of the title, (1/CN) indicates the score of the same word per phase, and SCN indicates the same number of words. The title and sentence similarity calculation formula is as follows:
Figure BDA0002562856970000091
the simplified formula is as follows:
Figure BDA0002562856970000092
3.4 whether or not they contain affective words
The invention aims to judge the emotional tendency of the text, so whether the sentence contains emotional words or not is an important characteristic. The basic emotion dictionary of the CUCsas (sensory analysis system) tendency analysis system is more than 2 ten thousand and 7 thousand, and the words and emotion labeling can be automatically performed on the input sentences. Such as: we need to fully pay attention to and utilize the unique resource advantages to establish a good folk cultural ecological environment. After word segmentation and emotion labeling:
we want/v sufficient/a attach/v and/c use/v these/r are inherently thick/iv/po/u resource dominance/n/po,/w build/v good/a/po/u folk culture/n ecological environment/ln. W
Wherein ne represents negative, po represents positive, and whether the sentences contain emotional words or not can be easily judged through emotional labeling. And counting 0.7 points for sentences containing emotional words, and adding 0.1 point to each added emotional word on the basis of the 0.7 points, wherein the highest point is 1 point.
3.5 whether or not there is a potential subject sentence
It is found statistically herein that the average sentence length of a long text is around 39 words, and if the length of a sentence is too long or too short, or contains "statement, comment" or the like, it is generally irrelevant to the subject sentence, and such a sentence is not given a score. Except as otherwise, are considered potential subject sentences. All potential subject sentences which are qualified after verification are scored as 1.
3.6 weighted calculation
The invention determines the highest score of each sentence as 10 scores, and has 5 dimension characteristics which are called as weighting factors. For the convenience of discussion, the following five factors "position, topic relevance, emotional word, feature word, and whether there is a potential topic sentence" are abbreviated as "LO, TR, EW, FW, PT", respectively. According to the invention, a subject sentence extraction experiment is carried out on the fine labeled corpus, each factor is averaged into 2 points, all 5 factors are used for weighting and identifying the subject sentence, then one of five factors is deleted each time, 4 of the five factors are left for testing, and the condition is shown in Table 4.
TABLE 4 identification Effect of five-factor average weight topic sentence
Figure BDA0002562856970000101
From the experimental results, the importance of the five factors, ranked from high to low, is: position, topic relevance, emotional words, feature words, whether a topic sentence is potential or not. We readjust the weights, and the weights of the five factors are shown in the following table.
TABLE 5 optimization factor weights
Factor(s) Characteristic words Topic relevance Emotional words Position of Potential subject sentence
Weight of 2 2 2 3 1
After the weight is re-optimized, the accuracy rate of identifying the subject sentence reaches 0.49, which is close to 50%. The formula for the topic sentence weighting calculation is obtained as follows.
TopSenScare=(FW+Sim(Title,Sen)+EW)*2+LO*3+PT
Evaluation of 4-topic sentence recognition method
In order to verify the effect of identifying the topic key sentence, the invention participates in the same evaluation of the COAE2014 task 1, the task level is chapter level, the task name is 'sentiment key sentence extraction and judgment facing news', and the sentiment key sentence of each article is required to be judged in a given news set (each article is cut into sentences), and the tendency polarity judgment is carried out on the sentiment sentence. The corpus of the test is from a web news text, the format of xml is provided with titles and no segment marks, and the text is divided into sentences which are arranged according to the sequence of the line language.
4.1 Algorithm Fine tuning
(1) The sentences are not divided according to the first segment and the last segment, but all the sentences are divided into three parts of a front 3 sentence, a rear 3 sentence and a middle sentence, and different weights are respectively given;
(2) the characteristic words commonly used when opinions are published in news are added, such as learning, introduction and the like, and the commonly used combined expression mode of the opinions in news is given higher weight, such as 'name of person + representation (thinking, feeling, saying and the like').
4.2 evaluation results and analysis
After internal evaluation, evaluation institutions are essentially deep research institutes and laboratories in the natural language processing field in China, and can basically represent the integral level of the natural language processing technology in China at present. The system cassia-cuc, submitted three sets of results, the best of which is shown in the table below.
The following table 6, casca-cuc, evaluates performance in COAE 2014:
runtag PosR PosP PosF1 NegR NegP NegF1 Accuracy MicroR MicroP MicroF1
Baseline 0.125 0.028 0.046 0.047 0.031 0.037 0.029 0.241 0.065 0.102
casia-cuc 0.255 0.055 0.091 0.240 0.071 0.109 0.063 0.333 0.090 0.142
average 0.154 0.037 0.057 0.101 0.049 0.062 0.043 0.219 0.073 0.110
median 0.130 0.034 0.049 0.068 0.052 0.057 0.041 0.220 0.068 0.104
best 0.464 0.063 0.110 0.240 0.085 0.109 0.065 0.389 0.104 0.164
table 6 illustrates: baseline scores sentences by adopting a method of keyword accumulation scoring, and extracts emotion key sentences; and judging the emotional tendency of the sentence by adopting a naive Bayes classification method. The accuracy of the best performance in all the competition systems is only 0.065, and the micro-average is 0.164, so that the trend analysis at the chapter level is quite difficult. The reason is as follows: on one hand, sentences of long texts are basically complex sentences, and the tendency polarity judgment difficulty is quite large; on the other hand, an article has a plurality of subject sentences, standard answers only allow 2 of the subject sentences to be selected, the judgment of a computer and a person can be greatly different, and the judgment is only calculated if the emotion key sentences are correctly identified and the tendency judgment is correct. The evaluation results of the people have positive tendency of F1 value of No. 2, negative tendency of F1 value of No. 1, precision of No. 2 and micro average of No. 3, and although the results are good, the precision is not optimistic, and a great promotion space is provided.
5 sentence emotional tendency analysis
When the subject sentence is identified, the next more important step is the emotional calculation of the subject sentence and the title, and the accuracy of the emotional calculation directly influences the trend analysis of the whole text. The method mainly adopts a CUCsas system in the emotional tendency analysis of the sentence level, has better performance in the multi-time emotional evaluation of the COAE, can reach the highest accuracy rate of 83% in the tendency intensity analysis (five-degree scoring) of the short text level, and the performance is the best performance of the polarity of the tendency of the COAE2012 short text level. The calculation flow of the above sentence is shown in fig. 1.
Step 3 comprises the following steps:
6 polarity calculation of long text (i.e. calculation of emotion tendentiousness of long text in the invention)
The analysis and processing of the title and the subject sentence are the basis of chapter tendency calculation, and the tendency calculation needs the combination of the title and the subject sentence to be completed according to the long text tendency analysis strategy. The invention only extracts two subject sentences from the text of the text, adds the subject sentences and the title together to form 3 sentences, respectively calculates the emotional tendency polarities (positive, negative and neutral) of the three sentences, and the tendency polarities of the three sentences have three conditions: all 3 sentences have completely the same emotional tendency, and the article tendency can be represented by taking any sentence tendency; 3 sentences have completely different tendencies and take the tendency of titles; two sentences have the same tendency, and the sentences with the same tendency are taken. The results of the tests carried out in coronus 2 are shown in Table 7.
TABLE 7 text tendency test details
Tendency to Text number Correct text Rate of accuracy
Front side 135 119 0.88
Negative side effect 183 146 0.80
Neutral (neutral) 7 2 0.29
Total up to 325 267 0.82
Table 7 above shows that the average accuracy reaches 0.82, and the method has been practically applied to the national language and character public opinion database. To further test the accuracy of the analysis of the trend of chapters, we performed the test results in the corrpus 3 as shown in the following table 8.
TABLE 8 text tendency test details
Tendency to Text number Correct text Rate of accuracy
Front side 141 119 0.84
Negative side effect 170 130 0.76
Neutral (neutral) 71 26 0.37
Total up to 382 275 0.72
Table 8 above shows that the average accuracy in the corpus evaluation is 0.72, which is 10% worse than that in the corpus in the language-text field, and therefore, the trend analysis of long texts has a larger promotion space. The main reason may be that the content of the evaluation corpus comes from different fields, and the characteristics of the subject sentences may be different for different fields.
Compared with the prior art, the text tendency is calculated by combining the title and the theme sentence, the difficulty is mainly in the identification precision of the theme sentence, the proportion of the whole text tendency can be judged to reach 0.61 from the title according to the statistics of the text emotion calculating method, and the text emotion calculating accuracy can be improved to the maximum extent by combining the title and the emotion sentence to finish the text emotion calculating. The accuracy of the method reaches 0.82 in the analysis of long texts in the language and text field. In view of the difficulty of long text analysis, the invention considers that in a practical system, the invention can be used for specifically analyzing specific problems and attacking one by one according to text characteristics in specific fields.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (1)

1. A long text emotion calculation method based on a two-topic method is characterized by comprising the following steps:
step 1: title analysis: the method comprises the following steps of counting data of two aspects of the number of texts of which the tendency of the whole text can be judged from the title and the number of texts of which the whole language 'theme' is reflected in the title from a corpus 1;
step 2: the theme sentence identification specifically comprises:
step 201: inputting a long text, wherein the long text refers to a text with a text space generally in the 900-charge 1100 characters;
step 202: after word segmentation labeling and topic sentence position characteristic labeling are carried out, topic relevance calculation is carried out, the word segmentation labeling comprises characteristic word labeling, and the characteristic words are respectively: single word cue words, double word cue words, suggestive vocabularies, language cue components, fixed structures, words representing summarization and words representing logic sequence; 156 topic sentence characteristic words are extracted from the topic sentence characteristic words and stored in a topic sentence marking word list through analysis of the topic sentence characteristic words; on the basis that 0.7 score is counted for sentences containing the characteristic words, 0.1 score is added for every additional characteristic word, and the highest score is 1 score; the position characteristics of the subject sentences are arranged according to the importance degree of the sentences, the scores of the subject sentences are arranged from high to low in the following order, the first sentence at the last stage is 1, the last sentence at the last stage is 0.95, the first sentence at the second stage is 0.9, the last sentence at the second stage is 0.85, the first sentence at the first stage is 0.8, the last sentence at the first stage is 0.75, the first sentence at the other stage is 0.7, the last sentence at the other stage is 0.65, the other sentence at the last stage is 0.6, the other sentence at the second stage is 0.55, the other sentence at the first stage is 0.5, and the other sentence at the other stage is 0.45; calculating topic relevance, wherein Sim (Title, Sen) represents the similarity between a Title and candidate topic sentences, WN (word number) represents the number of words of the Title, (1/WN) represents the score of each same word, and SWN represents the same number of words; CN (character number) represents the word number of the title, (1/CN) represents the score of each same word, and SCN represents the same word number; the title and sentence similarity calculation formula is as follows:
Figure FDA0002562856960000021
the simplified formula is as follows:
Figure FDA0002562856960000022
step 203: emotion marking and phrase emotion value calculation, and automatic word segmentation and emotion marking of input sentences by using a basic emotion dictionary of the CUCsas system; counting the sentence containing the emotional words by 0.7, and adding 0.1 to each added emotional word on the basis of the counting, wherein the maximum is 1;
step 204: calculating a sentence emotion value and outputting the sentence emotion value; calculating the sentence sentiment value means judging whether a potential subject sentence exists, and all the potential subject sentences which are qualified through verification are divided into 1; the five factors of position, topic relevance, emotional words, characteristic words and potential topic sentences are abbreviated as LO, TR, EW, FW and PT respectively; the formula for the topic sentence weighting calculation is as follows:
TopSenScore=(FW+Sim(Title,Sen)+EW)*2+LO*3+PT
and step 3: calculating the polarity of the long text, extracting two subject sentences from the text of the text, adding the subject sentences and the title together to obtain 3 sentences, and respectively calculating the emotional tendency polarities of the three sentences as follows: positive, negative, neutral, their tendency polarity is in total three cases: all 3 sentences have completely the same emotional tendency, and the article tendency can be represented by taking any sentence tendency; 3 sentences have completely different tendencies and take the tendency of titles; two sentences have the same tendency, and the sentences with the same tendency are taken.
CN202010613202.9A 2020-06-30 2020-06-30 Long text emotion calculation method based on double-question method Pending CN111783426A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010613202.9A CN111783426A (en) 2020-06-30 2020-06-30 Long text emotion calculation method based on double-question method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010613202.9A CN111783426A (en) 2020-06-30 2020-06-30 Long text emotion calculation method based on double-question method

Publications (1)

Publication Number Publication Date
CN111783426A true CN111783426A (en) 2020-10-16

Family

ID=72761231

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010613202.9A Pending CN111783426A (en) 2020-06-30 2020-06-30 Long text emotion calculation method based on double-question method

Country Status (1)

Country Link
CN (1) CN111783426A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076737A (en) * 2021-03-26 2021-07-06 三亚中科遥感研究所 Ecological environment perception network construction method fusing public emotion

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
NANCHANG CHENG 等: "Chinese Long Text Sentiment Analysis Based on the Combination of Title and Topic Sentences", 2019 6TH INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND THEIR APPLICATIONS (DSA), 26 March 2020 (2020-03-26), pages 348 - 352 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076737A (en) * 2021-03-26 2021-07-06 三亚中科遥感研究所 Ecological environment perception network construction method fusing public emotion
CN113076737B (en) * 2021-03-26 2023-01-31 三亚中科遥感研究所 Method for constructing ecological environment perception network fusing public emotions

Similar Documents

Publication Publication Date Title
Ghosh et al. Fracking sarcasm using neural network
CN109543178B (en) Method and system for constructing judicial text label system
CN101599071B (en) Automatic extraction method of conversation text topic
US10133733B2 (en) Systems and methods for an autonomous avatar driver
CN109829166B (en) People and host customer opinion mining method based on character-level convolutional neural network
CN110263319A (en) A kind of scholar's viewpoint abstracting method based on web page text
CN110705247B (en) Based on x2-C text similarity calculation method
CN110674296B (en) Information abstract extraction method and system based on key words
CN111538828A (en) Text emotion analysis method and device, computer device and readable storage medium
Abainia DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus
Jucker Corpus pragmatics
Tang et al. Evaluation of Chinese sentiment analysis APIs based on online reviews
Chaski Best practices and admissibility of forensic author identification
CN111310467B (en) Topic extraction method and system combining semantic inference in long text
Bertsch et al. Detection of puffery on the english wikipedia
CN111783426A (en) Long text emotion calculation method based on double-question method
Simaki et al. Evaluating stance-annotated sentences from the Brexit Blog Corpus: A quantitative linguistic analysis
Sweeney et al. Multi-entity sentiment analysis using entity-level feature extraction and word embeddings approach.
CN110825824B (en) User relation portrait method based on semantic visual/non-visual user character representation
Alrahabi et al. Automatic annotation of direct reported speech in arabic and french, according to a semantic map of enunciative modalities
Iwatsuki et al. Communicative-function-based sentence classification for construction of an academic formulaic expression database
Arkhangelskiy Russian verbal borrowings in Udmurt
Marin et al. Detecting authority bids in online discussions
Yang et al. Recognizing sentiment polarity in Chinese reviews based on topic sentiment sentences
WO2019132648A1 (en) System and method for identifying concern evolution within temporal and geospatial windows

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination