CN111783426A - Long text emotion calculation method based on double-question method - Google Patents
Long text emotion calculation method based on double-question method Download PDFInfo
- Publication number
- CN111783426A CN111783426A CN202010613202.9A CN202010613202A CN111783426A CN 111783426 A CN111783426 A CN 111783426A CN 202010613202 A CN202010613202 A CN 202010613202A CN 111783426 A CN111783426 A CN 111783426A
- Authority
- CN
- China
- Prior art keywords
- sentence
- sentences
- words
- title
- topic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000008451 emotion Effects 0.000 title claims abstract description 49
- 238000004364 calculation method Methods 0.000 title claims abstract description 39
- 238000000034 method Methods 0.000 title claims abstract description 27
- 238000004458 analytical method Methods 0.000 claims abstract description 34
- 230000002996 emotional effect Effects 0.000 claims description 39
- 238000002372 labelling Methods 0.000 claims description 15
- 230000011218 segmentation Effects 0.000 claims description 10
- 230000007935 neutral effect Effects 0.000 claims description 8
- 238000012795 verification Methods 0.000 claims description 4
- 238000011156 evaluation Methods 0.000 description 14
- 238000012360 testing method Methods 0.000 description 7
- 230000007547 defect Effects 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000011160 research Methods 0.000 description 5
- 239000000284 extract Substances 0.000 description 3
- 230000036651 mood Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000000605 extraction Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000008520 organization Effects 0.000 description 2
- 238000000926 separation method Methods 0.000 description 2
- 241000232219 Platanista Species 0.000 description 1
- 238000009825 accumulation Methods 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000001953 sensory effect Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 230000000699 topical effect Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/211—Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Abstract
The invention discloses a long text emotion calculation method based on a double-topic method, according to statistics of the method, the proportion of the whole language tendency can be judged to reach 0.61 from a title, and the text emotion calculation can be completed by combining the title and an emotion sentence together, so that the accuracy of the text emotion calculation can be improved to the maximum extent. The accuracy of the method reaches 0.82 in the analysis of long texts in the language and text field.
Description
Technical Field
The invention relates to the technical field of long text emotion calculation, in particular to a long text emotion calculation method based on a two-topic method.
Background
The text has wide application in multiple aspects of electronic commerce, business intelligence, information monitoring, civil investigation, electronic learning, newspaper editing, enterprise management and the like. In order to promote the development of text tendency analysis technology, related evaluation activities are carried out internationally and domestically, and a TREC Block Track [ Iadh Ounis et al, 2006 is mainly internationally; CraigMacdonald et al, 2007] and NTCIR [ Yohei Seki, 2008 ]. In China, since 2008, Chinese information society has held Chinese tendency analysis and evaluation (COAE) for many times, and Chinese computer society has also held Chinese microblog emotion analysis and evaluation in 2012 and 8 months.
The analysis difficulty of the tendency of long texts is high, and most of the current emotional tendency related research results are mainly concentrated on sentences or short text levels. Through weighted calculation of three aspects of emotional words, positions and keywords (such as over all, all in all and in my opinions), the emotional key sentences in the English text are obtained by sequencing according to the score, finally, a chapter-level tendency classifier based on the emotional key sentences is established by adopting a Bayesian algorithm, and the experimental result is improved by 2.84 percent compared with the foreign similar classification. The YOU Jianqing obtains the accuracy of 0.71 in the emotional calculation of 150 news chapters through the multi-dimensional characteristic weighting calculation of high-frequency words, positions and the like. At present, some researches adopt a deep learning method, but mainly focus on the sentence or word level, and also process the exploration phase on chapters. For example, Giatsoglou M et al, which uses a word vector model, have certain advantages in word context analysis, but ignore the emotion of the whole text.
The research object of the invention is a language chapter with a title and a complete chapter structure, the space of the language chapter is about 1000 characters generally, and the language chapter is called as a long text; without title, and web comments, short microblogs (140 words), etc., which have arbitrary organization, are called short texts. The analysis difficulty of the tendency of long texts is high, because one article often contains different topics, different topics may contain different emotions, and how to summarize the emotions of chapters from the diversified emotions is a problem worthy of research. .
Disclosure of Invention
The invention aims to provide a long text sentiment calculation method based on a double-topic method, wherein sentiment tendency analysis of a long text (chapters) has higher difficulty, and 2 topic key sentences which can most represent sentiment tendency of the whole language are calculated from an article by a method of weighted calculation of 5 dimensional characteristics such as positions, characteristic words, topic relevance, sentiment words, potential topic sentences and the like.
In order to solve the technical problem, the invention provides a long text emotion calculation method based on a two-topic method, which comprises the following steps of:
step 1: title analysis: the method comprises the following steps of counting data of two aspects of the number of texts of which the tendency of the whole text can be judged from the title and the number of texts of which the whole language 'theme' is reflected in the title from a corpus 1;
step 2: the theme sentence identification specifically comprises:
step 201: inputting a long text, wherein the long text refers to a text with a text space generally in the 900-charge 1100 characters;
step 202: after word segmentation labeling and topic sentence position characteristic labeling are carried out, topic relevance calculation is carried out, the word segmentation labeling comprises characteristic word labeling, and the characteristic words are respectively: single word cue words, double word cue words, suggestive vocabularies, language cue components, fixed structures, words representing summarization and words representing logic sequence; 156 topic sentence characteristic words are extracted from the topic sentence characteristic words and stored in a topic sentence marking word list through analysis of the topic sentence characteristic words; on the basis that 0.7 score is counted for sentences containing the characteristic words, 0.1 score is added for every additional characteristic word, and the highest score is 1 score; the position characteristics of the subject sentences are arranged according to the importance degree of the sentences, the scores of the subject sentences are arranged from high to low in the following order, the first sentence at the last stage is 1, the last sentence at the last stage is 0.95, the first sentence at the second stage is 0.9, the last sentence at the second stage is 0.85, the first sentence at the first stage is 0.8, the last sentence at the first stage is 0.75, the first sentence at the other stage is 0.7, the last sentence at the other stage is 0.65, the other sentence at the last stage is 0.6, the other sentence at the second stage is 0.55, the other sentence at the first stage is 0.5, and the other sentence at the other stage is 0.45; calculating topic relevance, wherein Sim (Title, Sen) represents the similarity between a Title and candidate topic sentences, WN (word number) represents the number of words of the Title, (1/WN) represents the score of each same word, and SWN represents the same number of words; CN (character number) represents the word number of the title, (1/CN) represents the score of each same word, and SCN represents the same word number; the title and sentence similarity calculation formula is as follows:
the simplified formula is as follows:
step 203: emotion marking and phrase emotion value calculation, and automatic word segmentation and emotion marking of input sentences by using a basic emotion dictionary of the CUCsas system; counting the sentence containing the emotional words by 0.7, and adding 0.1 to each added emotional word on the basis of the counting, wherein the maximum is 1;
step 204: calculating a sentence emotion value and outputting the sentence emotion value; calculating the sentence sentiment value means judging whether a potential subject sentence exists, and all the potential subject sentences which are qualified through verification are divided into 1; the five factors of position, topic relevance, emotional words, characteristic words and potential topic sentences are abbreviated as LO, TR, EW, FW and PT respectively; the formula for the topic sentence weighting calculation is as follows:
TopSenScore=(FW+Sim(Title,Sen)+EW)*2+LO*3+PT
and step 3: calculating the polarity of the long text, extracting two subject sentences from the text of the text, adding the subject sentences and the title together to obtain 3 sentences, and respectively calculating the emotional tendency polarities of the three sentences as follows: positive, negative, neutral, their tendency polarity is in total three cases: all 3 sentences have completely the same emotional tendency, and the article tendency can be represented by taking any sentence tendency; 3 sentences have completely different tendencies and take the tendency of titles; two sentences have the same tendency, and the sentences with the same tendency are taken.
Compared with the prior art, the text tendency is calculated by combining the title and the theme sentence, the difficulty is mainly in the identification precision of the theme sentence, the proportion of the whole text tendency can be judged to reach 0.61 from the title according to the statistics of the text emotion calculating method, and the text emotion calculating accuracy can be improved to the maximum extent by combining the title and the emotion sentence to finish the text emotion calculating. The accuracy of the method reaches 0.82 in the analysis of long texts in the language and text field. In view of the difficulty of long text analysis, the invention considers that in a practical system, the invention can be used for specifically analyzing specific problems and attacking one by one according to text characteristics in specific fields.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The invention provides a long text emotion calculation method based on a two-topic method, which comprises the following steps of:
step 1: title analysis: the corpus 1 is counted with data of two aspects of the number of texts of which the tendency of the whole text can be judged from the title and the number of texts of which the whole language of 'subject' is reflected in the title.
Step 2: and (3) identifying a subject sentence:
step 201: inputting a long text, wherein the long text refers to a text with a text space generally in the 900-charge 1100 characters;
step 202: after word segmentation labeling and topic sentence position characteristic labeling are carried out, topic relevance calculation is carried out, the word segmentation labeling comprises characteristic word labeling, and the characteristic words are respectively: single word cue words, double word cue words, suggestive vocabularies, language cue components, fixed structures, words representing summarization and words representing logic sequence; 156 subject sentence characteristic words are extracted from the subject sentence characteristic words and stored in a subject sentence marking word list through the analysis of the subject sentence characteristic words. And on the basis that 0.7 score is calculated for each sentence containing the characteristic words, 0.1 score is added for each added characteristic word, and the highest score is 1. The position characteristics of the subject sentences can be arranged in the order of scores from high to low according to the importance degree of the sentences, namely, the last sentence (1 score), the last sentence (0.95 score), the second sentence (0.9 score), the second sentence (0.85 score), the first sentence (0.8 score), the first sentence (0.75 score), the other first sentence (0.7 score), the other sentence (0.65 score), the last sentence (0.6 score), the second sentence (0.55 score), the first sentence (0.5 score) and the other sentence (0.45 score). The degree of correlation of the subject is calculated,
sim (Title, Sen) represents the similarity between the Title and the candidate subject sentence, WN (word number) represents the number of words of the Title, (1/WN) represents the score of each same word, and SWN represents the same number of words; CN (character number) indicates the number of words of the title, (1/CN) indicates the score of the same word per phase, and SCN indicates the same number of words. The title and sentence similarity calculation formula is as follows:
the simplified formula is as follows:
step 203: emotion marking and phrase emotion value calculation, and automatic word segmentation and emotion marking of input sentences by using a basic emotion dictionary of the CUCsas system; and counting 0.7 points for sentences containing emotional words, and adding 0.1 point to each added emotional word on the basis of the 0.7 points, wherein the highest point is 1 point.
Step 204: calculating a sentence emotion value and outputting the sentence emotion value; the sentence emotion value is calculated by judging whether a potential subject sentence exists or not, and all the potential subject sentences which are qualified through verification are scored as 1. Five factors of position, topic relevance, emotional words, characteristic words and potential topic sentences are abbreviated as LO, TR, EW, FW and PT respectively. The formula for the topic sentence weighting calculation is as follows:
TopSenScore=(FW+Sim(Title,Sen)+EW)*2+LO*3+PT
and step 3: calculating the long text to polarity, extracting two subject sentences from the text of the text, adding the subject sentences and the title to form 3 sentences, and calculating the emotional tendency polarities (positive, negative and neutral) of the three sentences respectively, wherein the tendency polarities of the three sentences are three cases: all 3 sentences have completely the same emotional tendency, and the article tendency can be represented by taking any sentence tendency; 3 sentences have completely different tendencies and take the tendency of titles; two sentences have the same tendency, and the sentences with the same tendency are taken.
The step 1 specifically comprises the following steps:
1 Main concept
The title is the embodiment of the theme of the article and plays an important role in judging the tendency of the whole text, but the judgment of the tendency of the text by only the title is obviously insufficient. On the one hand, objectively some titles do not clearly show the viewpoint tendency, and on the other hand, in terms of the understanding of the reader, the titles can guide the reader to understand the viewpoint tendency of the article to some extent, but finally grasp the viewpoint tendency of the text by reading the text content. The invention discovers that not all sentences contribute to the tendency of the article through the characteristic analysis of the language, and can determine the polarity of the whole article, namely one or more key sentences. The expression characteristics of the sentences are connected with the chapter structure characteristics, such as the expression of the expression, the expression and the like, which are often found at the beginning and the end of an article or a paragraph. If the expression characteristics of the sentence on the text can be captured and effectively recognized, the tendency analysis of the chapters can be reduced to the sentence level, and the tendency of the text can be judged through the weighted calculation of the sentence.
2 title feature analysis
Although the title is short, the title is a 'literary sketch' of the article, and the theme of the whole article and the emotional mood of the author can be reflected in many times. The invention counts the data of two aspects of the number of texts which can judge the tendency of the whole text from the title and the number of texts of which the whole language 'theme' is reflected in the title from the corpus 1. The following table 1 is a survey of the title tendencies and the characteristics of the subjects:
as can be seen from Table 1, the title reflects the 92% of the theme of the article, and the title can judge that the text with the whole text tendency accounts for 61%, which is sufficient for the importance of the title.
Step 2:
3 subject sentence recognition method
The topic sentence is a sentence containing a text topic concept, which is an important carrier of a text center thought and is also a centralized embodiment of text content. The method adopts 5 dimension weighting calculations of characteristic words, positions, emotional words, topic relevance and potential topic sentences to identify the topic sentences, and the initial score of each dimension is 1 score at most.
3.1 characteristic words
The text extracts marked subject sentences from the corpus 1, and induces seven feature words of the subject sentences, wherein the feature words are as follows: single word cues, double word cues, suggestive vocabularies, linguistic cue components, fixed structures, words representing summarization, and words representing logical order.
(1) Single word prompting word
The word-word cues often appearing in the subject sentence mainly include "say, call, think, see", etc., which often co-occur with the evaluation subject, whose expression is generally a person's name, an organization name, etc., representing a certain point of view, and which often has a comma thereafter. Such as: the article states that a small dictionary fully reflects the transition of China times, and the socioeconomic changes of China are reflected in the small dictionary.
(2) Double-character prompt word
The commonly appearing double-word prompts in the subject sentence mainly include "think, express, feel, point out, suggest, call to" and the like, and they generally appear together with the subject of evaluation and the object, and there are commas, double quotations, colons and the like in the latter. Such as: experts point out that the existing Chinese textbook has four defects, namely classic defect, defect of children visual angle, defect of joy and defect of reality.
(3) Suggestive vocabulary
Suggestive vocabularies that often appear in subject sentences are mainly "in fact", "clearly", "for which, unquestionably, with ease, also that sentence, etc., often followed by a comma. Such as: in fact, the creativity and vitality presented by the network hotword have proven themselves worth.
(4) Wen language indicating ingredient
The prompting component of a part of theme sentences in the field of language character public opinion is a language word, which is very characteristic. For example, "a blind, a cloud, a stolen, a light, no need/no doubt, a modern design, and the like. (5) Fixed structure
Some fixed structures are also commonly used in the subject sentence, mainly "from/about/press/in … …/say/see", "from/about/press/in … …/say/see", "for/for … …", "look at … …", "if … …", "just like … …" etc. For example, for Taboada, it is very difficult for Argentina entrepreneurs to make business in China without knowing the Confucius and the cultural tradition of China.
(6) Words representing summary, abstract aspects
The invention collects the common words with the general meaning, and mainly comprises the words in summary, overall view, all in all, and the like. Such as: in a word, the quality of Chery cars is still reassuring.
(7) Words representing a logical order
In order to express the self view more orderly, some language words and public sentiment articles use ordinal words or other paraphrases for representing layers when representing logical sequence, and the sentences are usually subject sentences.
Finally, 156 topic sentence characteristic words are extracted from the topic sentence characteristic words and stored in a topic sentence marking word list through analysis of the topic sentence characteristic words. And on the basis that 0.7 score is calculated for each sentence containing the characteristic words, 0.1 score is added for each added characteristic word, and the highest score is 1.
3.2 topical sentence position features
In addition to the feature words, the subject sentences have obvious features in position, the features can also provide important clues for identification of the subject sentences, all the subject sentences are extracted from the corpus 1, and the total number of the subject sentences is 1701, and the results after statistical analysis are as follows.
(1) The subject sentence is counted at the first sentence, the last sentence and other positions of the paragraph, and the result is shown in table 2.
TABLE 2 distribution of subject sentences in the first, last, and other positions of each paragraph
Table 2 shows that there are 725 cases of topic sentences in the first paragraph, 512 cases of topic sentences in the last paragraph, and 464 cases of topic sentences in other positions of the paragraph, which account for only 27% of the total proportion. It can be seen that the subject sentences are mainly distributed in the text between the first sentences and the last sentences of each paragraph, wherein the first sentences are more than the last sentences.
(2) The distribution of the subject sentences in the first paragraph, the second paragraph, the last paragraph and other paragraphs is shown in table 3.
TABLE 3 distribution of subject sentences in paragraphs
Table 3 shows that 121 cases of the subject sentences appearing in the first paragraph account for only 7%, 196 cases of the subject sentences appearing in the second paragraph account for 12%, 492 cases of the subject sentences appearing in the last paragraph account for 29%, and 892 cases of the subject sentences appearing in the other paragraphs account for 52% of the total proportion. It can be seen that although the topic sentences are distributed uniformly in each paragraph of the text, the topic sentences distributed in the last paragraph are more than that, and reach a ratio of 29%, and then the second paragraph of the text, the topic sentences reach a ratio of 12%, and the topic sentences distributed in the first paragraph also occupy a ratio of 7%.
According to the statistical result, the sentences are divided into three categories, namely first sentences, last sentences and other sentences; paragraphs are divided into four categories, first, second, last, and others. According to the importance degree of sentences, the scores can be arranged in the order from high to low, namely, the last paragraph first sentence (1 score), the last paragraph last sentence (0.95 score), the second paragraph first sentence (0.9 score), the second paragraph last sentence (0.85), the first paragraph first sentence (0.8), the first paragraph last sentence (0.75), the other paragraph first sentence (0.7), the other paragraph last sentence (0.65), the last paragraph other sentence (0.6), the second paragraph other sentence (0.55), the first paragraph other sentence (0.5) and the other paragraph other sentence (0.45).
3.3 topic relevance computation
According to the statistics of the foregoing, 92% of titles can reflect article topics, and based on this, the present invention calculates the relevance of a sentence to a topic by the similarity of the sentence to the title. Two articles are described below as examples.
Example 1: title: road public transport try Suzhou call stop reporting passenger net friend feeling 'man'[28]
The subject sentence: the language is given inNetCause much over the collateralsNet friendDiscussion, most ofSuzhou passengerMeans "hear the sound really timesFeeling ofAnd (4) cutting. "
Example 2: title: one seeing 'Suzhou dialect subway station' in Su Di "[44]
The subject sentence: as a whole, asGround outsideTo comeSu (Chinese character of 'su')So as to fully understandSuzhou provinceThe moods of the people are in the native mood,using Suzhou Talking stationConvenient riding for native and old people who are unfamiliar with MandarinSubway。
The words with similar subject sentences and titles in example 1 are "net friends, suzhou and passengers", and the same words are "net feeling"; the subject sentence in example 2 is similar to the title by the words "foreign, suzhou, used, suzhou dialect, subway" and the same words are "su, newspaper, station". The "stop announcement" is a word separation, and if compared with the word separation, the title and the subject sentence are only two same words.
Since the title has its unique features, it is differentiated from the sentence of the body, and the purpose of similarity comparison is to investigate whether they talk about the same topic, not the complete semantic agreement. Based on the above, the advantages of the two models of the simple common words and the minimum editing distance are comprehensively considered, and certain improvement is performed. Simple common words can play a strong role when the words are identical, but if only a part of the two words are identical, the minimum edit distance can play a role, such as "suzhou" and "susu" above, the similarity is 0 using the simple common word model, and the similarity is higher using the minimum edit distance of 1 using the minimum edit distance.
Therefore, when calculating the similarity, not only how many identical words but also how many identical words are considered. The invention uses: sim (Title, Sen) represents the similarity between the Title and the candidate subject sentence, WN (word number) represents the number of words of the Title, (1/WN) represents the score of each same word, and SWN represents the same number of words; CN (character number) indicates the number of words of the title, (1/CN) indicates the score of the same word per phase, and SCN indicates the same number of words. The title and sentence similarity calculation formula is as follows:
the simplified formula is as follows:
3.4 whether or not they contain affective words
The invention aims to judge the emotional tendency of the text, so whether the sentence contains emotional words or not is an important characteristic. The basic emotion dictionary of the CUCsas (sensory analysis system) tendency analysis system is more than 2 ten thousand and 7 thousand, and the words and emotion labeling can be automatically performed on the input sentences. Such as: we need to fully pay attention to and utilize the unique resource advantages to establish a good folk cultural ecological environment. After word segmentation and emotion labeling:
we want/v sufficient/a attach/v and/c use/v these/r are inherently thick/iv/po/u resource dominance/n/po,/w build/v good/a/po/u folk culture/n ecological environment/ln. W
Wherein ne represents negative, po represents positive, and whether the sentences contain emotional words or not can be easily judged through emotional labeling. And counting 0.7 points for sentences containing emotional words, and adding 0.1 point to each added emotional word on the basis of the 0.7 points, wherein the highest point is 1 point.
3.5 whether or not there is a potential subject sentence
It is found statistically herein that the average sentence length of a long text is around 39 words, and if the length of a sentence is too long or too short, or contains "statement, comment" or the like, it is generally irrelevant to the subject sentence, and such a sentence is not given a score. Except as otherwise, are considered potential subject sentences. All potential subject sentences which are qualified after verification are scored as 1.
3.6 weighted calculation
The invention determines the highest score of each sentence as 10 scores, and has 5 dimension characteristics which are called as weighting factors. For the convenience of discussion, the following five factors "position, topic relevance, emotional word, feature word, and whether there is a potential topic sentence" are abbreviated as "LO, TR, EW, FW, PT", respectively. According to the invention, a subject sentence extraction experiment is carried out on the fine labeled corpus, each factor is averaged into 2 points, all 5 factors are used for weighting and identifying the subject sentence, then one of five factors is deleted each time, 4 of the five factors are left for testing, and the condition is shown in Table 4.
TABLE 4 identification Effect of five-factor average weight topic sentence
From the experimental results, the importance of the five factors, ranked from high to low, is: position, topic relevance, emotional words, feature words, whether a topic sentence is potential or not. We readjust the weights, and the weights of the five factors are shown in the following table.
TABLE 5 optimization factor weights
Factor(s) | Characteristic words | Topic relevance | Emotional words | Position of | Potential subject sentence |
Weight of | 2 | 2 | 2 | 3 | 1 |
After the weight is re-optimized, the accuracy rate of identifying the subject sentence reaches 0.49, which is close to 50%. The formula for the topic sentence weighting calculation is obtained as follows.
TopSenScare=(FW+Sim(Title,Sen)+EW)*2+LO*3+PT
Evaluation of 4-topic sentence recognition method
In order to verify the effect of identifying the topic key sentence, the invention participates in the same evaluation of the COAE2014 task 1, the task level is chapter level, the task name is 'sentiment key sentence extraction and judgment facing news', and the sentiment key sentence of each article is required to be judged in a given news set (each article is cut into sentences), and the tendency polarity judgment is carried out on the sentiment sentence. The corpus of the test is from a web news text, the format of xml is provided with titles and no segment marks, and the text is divided into sentences which are arranged according to the sequence of the line language.
4.1 Algorithm Fine tuning
(1) The sentences are not divided according to the first segment and the last segment, but all the sentences are divided into three parts of a front 3 sentence, a rear 3 sentence and a middle sentence, and different weights are respectively given;
(2) the characteristic words commonly used when opinions are published in news are added, such as learning, introduction and the like, and the commonly used combined expression mode of the opinions in news is given higher weight, such as 'name of person + representation (thinking, feeling, saying and the like').
4.2 evaluation results and analysis
After internal evaluation, evaluation institutions are essentially deep research institutes and laboratories in the natural language processing field in China, and can basically represent the integral level of the natural language processing technology in China at present. The system cassia-cuc, submitted three sets of results, the best of which is shown in the table below.
The following table 6, casca-cuc, evaluates performance in COAE 2014:
runtag | PosR | PosP | PosF1 | NegR | NegP | NegF1 | Accuracy | MicroR | MicroP | MicroF1 |
Baseline | 0.125 | 0.028 | 0.046 | 0.047 | 0.031 | 0.037 | 0.029 | 0.241 | 0.065 | 0.102 |
casia-cuc | 0.255 | 0.055 | 0.091 | 0.240 | 0.071 | 0.109 | 0.063 | 0.333 | 0.090 | 0.142 |
average | 0.154 | 0.037 | 0.057 | 0.101 | 0.049 | 0.062 | 0.043 | 0.219 | 0.073 | 0.110 |
median | 0.130 | 0.034 | 0.049 | 0.068 | 0.052 | 0.057 | 0.041 | 0.220 | 0.068 | 0.104 |
best | 0.464 | 0.063 | 0.110 | 0.240 | 0.085 | 0.109 | 0.065 | 0.389 | 0.104 | 0.164 |
table 6 illustrates: baseline scores sentences by adopting a method of keyword accumulation scoring, and extracts emotion key sentences; and judging the emotional tendency of the sentence by adopting a naive Bayes classification method. The accuracy of the best performance in all the competition systems is only 0.065, and the micro-average is 0.164, so that the trend analysis at the chapter level is quite difficult. The reason is as follows: on one hand, sentences of long texts are basically complex sentences, and the tendency polarity judgment difficulty is quite large; on the other hand, an article has a plurality of subject sentences, standard answers only allow 2 of the subject sentences to be selected, the judgment of a computer and a person can be greatly different, and the judgment is only calculated if the emotion key sentences are correctly identified and the tendency judgment is correct. The evaluation results of the people have positive tendency of F1 value of No. 2, negative tendency of F1 value of No. 1, precision of No. 2 and micro average of No. 3, and although the results are good, the precision is not optimistic, and a great promotion space is provided.
5 sentence emotional tendency analysis
When the subject sentence is identified, the next more important step is the emotional calculation of the subject sentence and the title, and the accuracy of the emotional calculation directly influences the trend analysis of the whole text. The method mainly adopts a CUCsas system in the emotional tendency analysis of the sentence level, has better performance in the multi-time emotional evaluation of the COAE, can reach the highest accuracy rate of 83% in the tendency intensity analysis (five-degree scoring) of the short text level, and the performance is the best performance of the polarity of the tendency of the COAE2012 short text level. The calculation flow of the above sentence is shown in fig. 1.
Step 3 comprises the following steps:
6 polarity calculation of long text (i.e. calculation of emotion tendentiousness of long text in the invention)
The analysis and processing of the title and the subject sentence are the basis of chapter tendency calculation, and the tendency calculation needs the combination of the title and the subject sentence to be completed according to the long text tendency analysis strategy. The invention only extracts two subject sentences from the text of the text, adds the subject sentences and the title together to form 3 sentences, respectively calculates the emotional tendency polarities (positive, negative and neutral) of the three sentences, and the tendency polarities of the three sentences have three conditions: all 3 sentences have completely the same emotional tendency, and the article tendency can be represented by taking any sentence tendency; 3 sentences have completely different tendencies and take the tendency of titles; two sentences have the same tendency, and the sentences with the same tendency are taken. The results of the tests carried out in coronus 2 are shown in Table 7.
TABLE 7 text tendency test details
Tendency to | Text number | Correct text | Rate of accuracy |
Front side | 135 | 119 | 0.88 |
Negative side effect | 183 | 146 | 0.80 |
Neutral (neutral) | 7 | 2 | 0.29 |
Total up to | 325 | 267 | 0.82 |
Table 7 above shows that the average accuracy reaches 0.82, and the method has been practically applied to the national language and character public opinion database. To further test the accuracy of the analysis of the trend of chapters, we performed the test results in the corrpus 3 as shown in the following table 8.
TABLE 8 text tendency test details
Tendency to | Text number | Correct text | Rate of accuracy |
Front side | 141 | 119 | 0.84 |
Negative side effect | 170 | 130 | 0.76 |
Neutral (neutral) | 71 | 26 | 0.37 |
Total up to | 382 | 275 | 0.72 |
Table 8 above shows that the average accuracy in the corpus evaluation is 0.72, which is 10% worse than that in the corpus in the language-text field, and therefore, the trend analysis of long texts has a larger promotion space. The main reason may be that the content of the evaluation corpus comes from different fields, and the characteristics of the subject sentences may be different for different fields.
Compared with the prior art, the text tendency is calculated by combining the title and the theme sentence, the difficulty is mainly in the identification precision of the theme sentence, the proportion of the whole text tendency can be judged to reach 0.61 from the title according to the statistics of the text emotion calculating method, and the text emotion calculating accuracy can be improved to the maximum extent by combining the title and the emotion sentence to finish the text emotion calculating. The accuracy of the method reaches 0.82 in the analysis of long texts in the language and text field. In view of the difficulty of long text analysis, the invention considers that in a practical system, the invention can be used for specifically analyzing specific problems and attacking one by one according to text characteristics in specific fields.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Claims (1)
1. A long text emotion calculation method based on a two-topic method is characterized by comprising the following steps:
step 1: title analysis: the method comprises the following steps of counting data of two aspects of the number of texts of which the tendency of the whole text can be judged from the title and the number of texts of which the whole language 'theme' is reflected in the title from a corpus 1;
step 2: the theme sentence identification specifically comprises:
step 201: inputting a long text, wherein the long text refers to a text with a text space generally in the 900-charge 1100 characters;
step 202: after word segmentation labeling and topic sentence position characteristic labeling are carried out, topic relevance calculation is carried out, the word segmentation labeling comprises characteristic word labeling, and the characteristic words are respectively: single word cue words, double word cue words, suggestive vocabularies, language cue components, fixed structures, words representing summarization and words representing logic sequence; 156 topic sentence characteristic words are extracted from the topic sentence characteristic words and stored in a topic sentence marking word list through analysis of the topic sentence characteristic words; on the basis that 0.7 score is counted for sentences containing the characteristic words, 0.1 score is added for every additional characteristic word, and the highest score is 1 score; the position characteristics of the subject sentences are arranged according to the importance degree of the sentences, the scores of the subject sentences are arranged from high to low in the following order, the first sentence at the last stage is 1, the last sentence at the last stage is 0.95, the first sentence at the second stage is 0.9, the last sentence at the second stage is 0.85, the first sentence at the first stage is 0.8, the last sentence at the first stage is 0.75, the first sentence at the other stage is 0.7, the last sentence at the other stage is 0.65, the other sentence at the last stage is 0.6, the other sentence at the second stage is 0.55, the other sentence at the first stage is 0.5, and the other sentence at the other stage is 0.45; calculating topic relevance, wherein Sim (Title, Sen) represents the similarity between a Title and candidate topic sentences, WN (word number) represents the number of words of the Title, (1/WN) represents the score of each same word, and SWN represents the same number of words; CN (character number) represents the word number of the title, (1/CN) represents the score of each same word, and SCN represents the same word number; the title and sentence similarity calculation formula is as follows:
the simplified formula is as follows:
step 203: emotion marking and phrase emotion value calculation, and automatic word segmentation and emotion marking of input sentences by using a basic emotion dictionary of the CUCsas system; counting the sentence containing the emotional words by 0.7, and adding 0.1 to each added emotional word on the basis of the counting, wherein the maximum is 1;
step 204: calculating a sentence emotion value and outputting the sentence emotion value; calculating the sentence sentiment value means judging whether a potential subject sentence exists, and all the potential subject sentences which are qualified through verification are divided into 1; the five factors of position, topic relevance, emotional words, characteristic words and potential topic sentences are abbreviated as LO, TR, EW, FW and PT respectively; the formula for the topic sentence weighting calculation is as follows:
TopSenScore=(FW+Sim(Title,Sen)+EW)*2+LO*3+PT
and step 3: calculating the polarity of the long text, extracting two subject sentences from the text of the text, adding the subject sentences and the title together to obtain 3 sentences, and respectively calculating the emotional tendency polarities of the three sentences as follows: positive, negative, neutral, their tendency polarity is in total three cases: all 3 sentences have completely the same emotional tendency, and the article tendency can be represented by taking any sentence tendency; 3 sentences have completely different tendencies and take the tendency of titles; two sentences have the same tendency, and the sentences with the same tendency are taken.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010613202.9A CN111783426A (en) | 2020-06-30 | 2020-06-30 | Long text emotion calculation method based on double-question method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010613202.9A CN111783426A (en) | 2020-06-30 | 2020-06-30 | Long text emotion calculation method based on double-question method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111783426A true CN111783426A (en) | 2020-10-16 |
Family
ID=72761231
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010613202.9A Pending CN111783426A (en) | 2020-06-30 | 2020-06-30 | Long text emotion calculation method based on double-question method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111783426A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113076737A (en) * | 2021-03-26 | 2021-07-06 | 三亚中科遥感研究所 | Ecological environment perception network construction method fusing public emotion |
-
2020
- 2020-06-30 CN CN202010613202.9A patent/CN111783426A/en active Pending
Non-Patent Citations (1)
Title |
---|
NANCHANG CHENG 等: "Chinese Long Text Sentiment Analysis Based on the Combination of Title and Topic Sentences", 2019 6TH INTERNATIONAL CONFERENCE ON DEPENDABLE SYSTEMS AND THEIR APPLICATIONS (DSA), 26 March 2020 (2020-03-26), pages 348 - 352 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113076737A (en) * | 2021-03-26 | 2021-07-06 | 三亚中科遥感研究所 | Ecological environment perception network construction method fusing public emotion |
CN113076737B (en) * | 2021-03-26 | 2023-01-31 | 三亚中科遥感研究所 | Method for constructing ecological environment perception network fusing public emotions |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Ghosh et al. | Fracking sarcasm using neural network | |
CN109543178B (en) | Method and system for constructing judicial text label system | |
CN101599071B (en) | Automatic extraction method of conversation text topic | |
US10133733B2 (en) | Systems and methods for an autonomous avatar driver | |
CN109829166B (en) | People and host customer opinion mining method based on character-level convolutional neural network | |
CN110263319A (en) | A kind of scholar's viewpoint abstracting method based on web page text | |
CN110705247B (en) | Based on x2-C text similarity calculation method | |
CN110674296B (en) | Information abstract extraction method and system based on key words | |
CN111538828A (en) | Text emotion analysis method and device, computer device and readable storage medium | |
Abainia | DZDC12: a new multipurpose parallel Algerian Arabizi–French code-switched corpus | |
Jucker | Corpus pragmatics | |
Tang et al. | Evaluation of Chinese sentiment analysis APIs based on online reviews | |
Chaski | Best practices and admissibility of forensic author identification | |
CN111310467B (en) | Topic extraction method and system combining semantic inference in long text | |
Bertsch et al. | Detection of puffery on the english wikipedia | |
CN111783426A (en) | Long text emotion calculation method based on double-question method | |
Simaki et al. | Evaluating stance-annotated sentences from the Brexit Blog Corpus: A quantitative linguistic analysis | |
Sweeney et al. | Multi-entity sentiment analysis using entity-level feature extraction and word embeddings approach. | |
CN110825824B (en) | User relation portrait method based on semantic visual/non-visual user character representation | |
Alrahabi et al. | Automatic annotation of direct reported speech in arabic and french, according to a semantic map of enunciative modalities | |
Iwatsuki et al. | Communicative-function-based sentence classification for construction of an academic formulaic expression database | |
Arkhangelskiy | Russian verbal borrowings in Udmurt | |
Marin et al. | Detecting authority bids in online discussions | |
Yang et al. | Recognizing sentiment polarity in Chinese reviews based on topic sentiment sentences | |
WO2019132648A1 (en) | System and method for identifying concern evolution within temporal and geospatial windows |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |