CN108268668B - Topic diversity-based text data viewpoint abstract mining method - Google Patents

Topic diversity-based text data viewpoint abstract mining method Download PDF

Info

Publication number
CN108268668B
CN108268668B CN201810166896.9A CN201810166896A CN108268668B CN 108268668 B CN108268668 B CN 108268668B CN 201810166896 A CN201810166896 A CN 201810166896A CN 108268668 B CN108268668 B CN 108268668B
Authority
CN
China
Prior art keywords
topic
sentence
emotion
word sequence
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810166896.9A
Other languages
Chinese (zh)
Other versions
CN108268668A (en
Inventor
廖祥文
陈国龙
赵楠
杨定达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fuzhou University
Original Assignee
Fuzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuzhou University filed Critical Fuzhou University
Priority to CN201810166896.9A priority Critical patent/CN108268668B/en
Publication of CN108268668A publication Critical patent/CN108268668A/en
Application granted granted Critical
Publication of CN108268668B publication Critical patent/CN108268668B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a text data viewpoint abstract mining method based on topic diversity, which comprises the following steps: step S1: preprocessing topic texts; step S2: inputting a topic corpus and a background corpus; step S3: extracting topic attributes of the topic corpus; step S4: adding emotion polarity to the obtained topic attribute for sentence vectorization; step S5: taking the obtained topic attributes as evaluation objects, analyzing the emotion polarity of the evaluation objects contained in the sentences by adopting a dynamic word sequence emotion analysis method facing to multiple evaluation objects to obtain the emotion attribute characteristics contained in the sentences, and carrying out characteristic vectorization on one sentence; step S6: and constructing a diversity objective function by using the text sentence feature vectors obtained in the step S5. The method can efficiently and accurately obtain the viewpoint abstract of the topic text, and can be applied to a larger-scale data set application scene.

Description

Topic diversity-based text data viewpoint abstract mining method
Technical Field
The invention relates to the field of text summarization and sentiment analysis, in particular to a method for generating a brief viewpoint summary rich in user sentiment information for massive topic text data of Chinese microblog linguistic data, wherein the viewpoint summary can accurately cover key contents discussed by a text and can be applied to practical application scenes such as news summaries, commodity comment summaries and the like.
Background
Currently, there are many technical approaches available for research in the field of opinion summarization. Conventional view summary models include graph models and ranking models. The representation method of the graph model comprises methods such as Textrank, PageRank and LexRank, sentences are used as nodes, a certain relation between the sentences is used as the weight of an edge, iterative updating calculation is carried out on scores of the sentences through a random walk model, scoring of the sentences is achieved, a certain number of sentences with high scores are selected to be combined into a viewpoint abstract, a ranking model is used for constructing a sentence scoring function to achieve scoring of the sentences from the consideration factors such as diversity and redundancy of the viewpoint abstract, or a KL divergence and MMR method are used for carrying out relative score ranking on the sentences, and the viewpoint abstract is obtained through score ranking. The two methods ignore the text topic attribute with finer granularity, consider the diversity of the text subject matter through the diversity of all words in the text, do not consider the influence of the keywords of the text subject matter on the view abstract, and limit the follow-up research of the model to a certain extent.
At present, researchers at home and abroad continuously research the viewpoint abstract models by means of a generative formula and a submodel function. The method has a good effect, but the time complexity of algorithm solution is too high, and it takes several times of time of other methods for a short data set, and the method can not be applied to an actual scene under a big data background. The view abstract method based on the submodular function ensures that the obtained local solution can be not lower than 63% of the optimal solution by using the greedy algorithm through the submodular function property, the greedy algorithm takes the conditions of various elements into consideration to select sentences, and although the experimental effect is relatively good, the mode of manually constructing the corpus tree is not suitable for wider application scenes.
In general, two fundamental properties of the view abstract are: 1) ensuring that the obtained abstract covers the subject text subject; 2) the obtained abstract should cover the topic subject matter with rich emotional colors. The method has the defects that most of the existing models consider that the viewpoint abstract covers the text subject by utilizing the diversity of all words of the text sentence, the diversity of the abstract is ensured by the diversity of the words, but the diversity of the words cannot ensure that the viewpoint abstract covers the subject of the source text, the words irrelevant to the subject can influence the finally generated viewpoint abstract, the emotion information of the abstract is described by the emotion information of the whole text sentence in the existing research method, and the emotions of the subject of many irrelevant texts are also considered, so that the finally obtained abstract comprises a plurality of contents and emotion information irrelevant to the text subject.
Therefore, a more efficient and accurate method for researching a viewpoint abstract is urgently needed, which extracts topic attribute words from a source text by an entity extraction method as text subject key words, researches emotion information about topic attributes serving as evaluation objects in each sentence by combining an emotion analysis research method, and selects sentences to combine into the viewpoint abstract by a topic attribute diversity method for fusing the importance of the sentences, so that the whole viewpoint abstract contains the most text subjects with emotion information.
Disclosure of Invention
The invention aims to solve the problem of compression of massive viewpoint text data, provides a viewpoint summarization method based on topic diversity, solves the problems of the current research method from topic attributes and emotional information thereof, can efficiently and accurately obtain the viewpoint summarization of topic texts, and can be applied to a larger-scale data set application scene.
In order to achieve the purpose, the invention adopts the following technical scheme: a text data viewpoint abstract mining method based on topic diversity is characterized by comprising the following steps: the method comprises the following steps: step S1: preprocessing topic texts, and filtering irrelevant texts without substantial content and any significance and common stop words; step S2: inputting a topic corpus and a background corpus; step S3: extracting topic attributes of the topic corpus; step S4: adding emotion polarities to the topic attributes obtained in the step S3, wherein the emotion polarities comprise positive emotions and negative emotions, and the positive topic attributes and the negative topic attributes are used as emotion attribute characteristics and are used for vectorizing the sentence; step S5: taking the topic attribute obtained in the step S3 as an evaluation object, analyzing the emotion polarity of the evaluation object contained in the sentence by adopting a dynamic word sequence emotion analysis method facing multiple evaluation objects to obtain the emotion attribute characteristics contained in the sentence, wherein if the sentence contains the emotion attribute characteristics, the corresponding characteristic value is 1, and if the sentence does not contain the emotion attribute characteristics, the corresponding characteristic value is 0, so that one sentence is subjected to characteristic vectorization by the topic attribute and emotion analysis method; step S6: and constructing a diversity objective function by using the text sentence feature vectors obtained in the step S5.
In one embodiment of the present invention, the filtering rule in step S1 is as follows: (1) removing the webpage links in the comment sentences; (2) removing the comment sentences with the character length smaller than 3; (3) removing common irrelevant words in the comment sentences; (4) and all English tables are written in small or large.
In an embodiment of the present invention, step S2 includes the following steps: and aiming at the preprocessed text, setting the current topic text as a topic corpus and other topic texts as a background corpus.
In an embodiment of the present invention, in step S3, the log likelihood ratio of the words in the topic corpus is calculated by using a log likelihood ratio method, and the words are filtered by using a threshold value to extract the topic attributes of the topic corpus, where the part-of-speech requirement of the words must be noun, adjective, verb, and digit.
In an embodiment of the present invention, step S4 includes the following specific steps: and taking the obtained topic attribute as an evaluation object, analyzing the emotion polarity of the evaluation object in a sentence by utilizing a dynamic word sequence emotion analysis method facing to multiple evaluation objects, and adding positive and negative emotion polarities to the topic attribute respectively to obtain the positive topic attribute and the negative topic attribute respectively.
Further, the topic attribute with emotion in step S4 is used as an emotional topic attribute and is used as a feature for sentence feature vectorization in step S6; the multi-evaluation object oriented dynamic word sequence emotion analysis method is a word bag model based on an emotion dictionary and mainly comprises the following two steps: firstly, cutting a sentence word sequence by using a dynamic word sequence method to obtain a word sequence of each evaluation object contained in a sentence; secondly, matching the word sequence emotional words of each evaluation object by using an emotional dictionary, calculating the emotional tendency of the evaluation object by using the polarity and the weight of the emotional words, and obtaining a sentence feature vector according to the topic attribute and the emotional polarity in the sentence;the method flow of the word sequence is briefly described as follows: step S41: determining the position of an evaluation object in a sentence; for each sentence, the topic attributes in the topic attribute set are used as evaluation objects, and the positions of the evaluation objects in the sentence are determined according to the front-to-back direction of the sentence, wherein the positions are from small to large; step S42: by evaluating the position of the object in the sentence
Figure 100002_DEST_PATH_IMAGE002
Expanding towards the left and right directions for the center until meeting punctuation marks or other evaluation objects; if the punctuation marks are encountered leftwards or rightwards, intercepting the punctuation marks to a left word sequence or a right word sequence in the evaluation object; if meeting other evaluation objects leftwards or rightwards, taking the middle coordinate of the position coordinates of the two evaluation objects, and intercepting the word sequence from the middle coordinate to the position of the evaluation object as a left word sequence or a right word sequence; step S423: obtaining a left word sequence of a certain evaluation object after the steps
Figure 100002_DEST_PATH_IMAGE004
And a sequence of right words
Figure 100002_DEST_PATH_IMAGE006
Combining the left word sequence and the right word sequence to obtain a complete word sequence of the evaluation object
Figure 100002_DEST_PATH_IMAGE008
Wherein
Figure 998852DEST_PATH_IMAGE002
Is an evaluation object, is removed during emotion analysis, and has a parameter range
Figure DEST_PATH_IMAGE010
The specific values of the two are dynamically changed and have no fixed value, and the parameter values of the two are different for two different word sequences; at the same time, for any two dynamic word sequences
Figure DEST_PATH_IMAGE012
And
Figure DEST_PATH_IMAGE014
satisfies the conditions
Figure DEST_PATH_IMAGE016
I.e. a sequence of words where the two sequences of words do not coincide.
In an embodiment of the present invention, in step S6, a diversity objective function is constructed, where the objective function ensures that the emotional topic attributes included in a certain number of sentence sets are selected to be the most, and the sentence sets are used as viewpoint summaries, thereby ensuring that the finally obtained viewpoint summary has the best diversity.
In an embodiment of the present invention, the construction of the objective function includes the following steps: firstly, a viewpoint sentence scoring function fusing topic diversity and sentence importance is constructed, topic attribute differences between sentences and a summary set are considered in the scoring function, the importance of the sentences is fused, the importance of the sentences is obtained by using topic attribute weights and topic attributes contained in the sentences, the sentences with the maximum topic diversity of the viewpoint summary set are added into the viewpoint summary by selection, wherein the number of the sentences is limited within 20 or the number of the sentences is limited by a certain compression ratio.
Compared with the prior art, the invention has the following beneficial effects:
1. the method has the advantages that the data are preprocessed, so that the application is wider, the data are cleaned on the basis of the original data, irrelevant texts are filtered, the topic attribute extracted by the topic attribute extracting method is more accurate, and the method can be applied to the field of Chinese microblogs and can be applied to the field of website news and commodity comments.
2. Only positive and negative emotions of topic attributes are considered, and neutral emotions are not discussed. The problems existing in the current research method are solved from topic attributes and emotional information thereof, the viewpoint abstract of the topic text can be efficiently and accurately obtained, and the method can be applied to a larger-scale data set application scene.
Drawings
FIG. 1 is a schematic view of the main process of the present invention.
Detailed Description
The invention is further explained below with reference to the figures and the specific embodiments.
The invention provides a text data viewpoint abstract mining method based on topic diversity, which comprises the following steps: step S1: preprocessing topic texts, and filtering irrelevant texts without substantial content and any significance and common stop words; step S2: inputting a topic corpus and a background corpus; step S3: extracting topic attributes of the topic corpus; step S4: adding emotion polarities to the topic attributes obtained in the step S3, wherein the emotion polarities comprise positive emotions and negative emotions, and the positive topic attributes and the negative topic attributes are used as emotion attribute characteristics and are used for vectorizing the sentence; step S5: taking the topic attribute obtained in the step S3 as an evaluation object, analyzing the emotion polarity of the evaluation object contained in the sentence by adopting a dynamic word sequence emotion analysis method facing multiple evaluation objects to obtain the emotion attribute characteristics contained in the sentence, wherein if the sentence contains the emotion attribute characteristics, the corresponding characteristic value is 1, and if the sentence does not contain the emotion attribute characteristics, the corresponding characteristic value is 0, so that one sentence is subjected to characteristic vectorization by the topic attribute and emotion analysis method; step S6: and constructing a diversity objective function by using the text sentence feature vectors obtained in the step S5.
In one embodiment of the present invention, the filtering rule in step S1 is as follows:
(1) web page links in the comment sentence, such as "http:// t. cn/RcwWYQZ", are removed.
(2) And removing comment sentences with the character length smaller than 3, wherein the comment sentences contain too little information, most of the comment sentences are emoticons, and no other useful information exists.
(3) Common irrelevant words such as 'group pictures', 'original text forwarding' and the like in the comment sentences are removed.
(4) And (4) unifying all English tables into small-case English.
In an embodiment of the present invention, step S2 includes the following steps: and aiming at the preprocessed text, setting the current topic text as a topic corpus and other topic texts as a background corpus.
In an embodiment of the present invention, in step S3, the log likelihood ratio of the words in the topic corpus is calculated by using a log likelihood ratio method, and the words are filtered by using a threshold value to extract the topic attributes of the topic corpus, where the part-of-speech requirement of the words must be noun, adjective, verb, and digit.
In an embodiment of the present invention, step S4 includes the following specific steps: the obtained topic attributes are used as evaluation objects, the emotion polarities of the evaluation objects in sentences are analyzed by utilizing a multi-evaluation-object-oriented dynamic word sequence emotion analysis method, positive emotion polarities and negative emotion polarities are added to the topic attributes respectively to obtain positive topic attributes and negative topic attributes respectively, and all the topic attributes with positive emotions are expressed as a set
Figure DEST_PATH_IMAGE018
The negative topic attribute set is
Figure DEST_PATH_IMAGE020
And taking the topic attributes with emotion as emotional topic attributes and using the topic attributes with emotion as features for sentence feature vectorization of the next step.
Further, the topic attribute with emotion in step S4 is used as an emotional topic attribute and is used as a feature for sentence feature vectorization in step S6; the multi-evaluation object oriented dynamic word sequence emotion analysis method is a word bag model based on an emotion dictionary and mainly comprises the following two steps: firstly, cutting a sentence word sequence by using a dynamic word sequence method to obtain a word sequence of each evaluation object contained in a sentence; secondly, matching the word sequence emotional words of each evaluation object by using an emotional dictionary, calculating the emotional tendency of the evaluation object by using the polarity and the weight of the emotional words, and obtaining a sentence feature vector according to the topic attribute and the emotional polarity in the sentence; the method flow of the word sequence is briefly described as follows: step S41: determining the position of an evaluation object in a sentence; for each sentence, the topic attributes in the topic attribute set are used as evaluation objects, and each evaluation pair is determined according to the front-to-back direction of the sentenceLike the position in the sentence, the position is from small to large; step S42: by evaluating the position of the object in the sentence
Figure 211265DEST_PATH_IMAGE002
Expanding towards the left and right directions for the center until meeting punctuation marks or other evaluation objects; if the punctuation marks are encountered leftwards or rightwards, intercepting the punctuation marks to a left word sequence or a right word sequence in the evaluation object; if meeting other evaluation objects leftwards or rightwards, taking the middle coordinate of the position coordinates of the two evaluation objects, and intercepting the word sequence from the middle coordinate to the position of the evaluation object as a left word sequence or a right word sequence; step S423: obtaining a left word sequence of a certain evaluation object after the steps
Figure 787740DEST_PATH_IMAGE004
And a sequence of right words
Figure 979687DEST_PATH_IMAGE006
Combining the left word sequence and the right word sequence to obtain a complete word sequence of the evaluation object
Figure 590797DEST_PATH_IMAGE008
Wherein
Figure 209997DEST_PATH_IMAGE002
Is an evaluation object, is removed during emotion analysis, and has a parameter range
Figure 8189DEST_PATH_IMAGE010
The specific values of the two are dynamically changed and have no fixed value, and the parameter values of the two are different for two different word sequences; at the same time, for any two dynamic word sequences
Figure 9686DEST_PATH_IMAGE012
And, satisfies the conditions
Figure 209723DEST_PATH_IMAGE016
I.e. a sequence of words where the two sequences of words do not coincide.
In an embodiment of the present invention, in step S6, a diversity objective function is constructed, where the objective function ensures that the emotional topic attributes included in a certain number of sentence sets are selected to be the most, and the sentence sets are used as viewpoint summaries, thereby ensuring that the finally obtained viewpoint summary has the best diversity.
In an embodiment of the present invention, the construction of the objective function includes the following steps: firstly, a viewpoint sentence scoring function fusing topic diversity and sentence importance is constructed, topic attribute differences between sentences and a summary set are considered in the scoring function, the importance of the sentences is fused, the importance of the sentences is obtained by using topic attribute weights and topic attributes contained in the sentences, the sentences with the maximum topic diversity of the viewpoint summary set are added into the viewpoint summary by selection, wherein the number of the sentences is limited within 20 or the number of the sentences is limited by a certain compression ratio.
The method comprises two basic assumptions, 1) topic attribute of the text is the central thought and the main idea of the text; 2) the same topic attributes of different emotions are different, and both the attributes serve as the subject matter and the main discussion content of the text; the topic attribute extraction method based on the log likelihood ratio is included; the method comprises a dynamic word sequence sentiment analysis method facing multiple evaluation objects, wherein sentences are cut according to topic attributes by using the dynamic word sequence method to obtain subsequences of each topic attribute; the method comprises the steps of vectorizing a sentence based on emotional topic attributes, and vectorizing the sentence by using the topic attributes with emotion as features; the topic diversity-based viewpoint sentence selection method is included, and the topic diversity of the obtained viewpoint abstract is guaranteed to be maximum from the two aspects of topic diversity and sentence importance. Fig. 1 is a schematic main flow chart of an embodiment of the present invention.
Step S1: preprocessing the microblog corpus to clear some irrelevant words and avoid influencing the extraction of topic attributes.
Step S2: inputting topic corpus and background corpus, wherein the background corpus is composed of other topic corpora. Step S3: and extracting the topic attributes of the topic corpus by using a log likelihood ratio method.
Step S4: and adding emotion polarities to the topic attributes obtained in the step S3, wherein the emotion polarities include positive emotion and negative emotion, and thus the positive topic attributes and the negative topic attributes are used as emotion attribute features for vectorizing the sentence.
Step S5: and 3, taking the topic attribute obtained in the step 3 as an evaluation object, analyzing the emotion polarity of the evaluation object contained in the sentence by adopting a dynamic word sequence emotion analysis method facing to multiple evaluation objects to obtain the emotion attribute characteristics contained in the sentence, wherein if the sentence contains the emotion attribute characteristics, the corresponding characteristic value is 1, and if the sentence does not contain the emotion attribute characteristics, the corresponding characteristic value is 0, and thus, one sentence is subjected to characteristic vectorization through the topic attribute and emotion analysis method. The dynamic word sequence method facing to the multiple evaluation objects is a word bag model based on an emotion dictionary and mainly comprises the following two steps: firstly, cutting a sentence word sequence by using a dynamic word sequence method to obtain a word sequence of each evaluation object contained in a sentence; and secondly, matching the word sequence emotional words of each evaluation object by using an emotional dictionary, calculating the emotional tendency of the evaluation object by using the polarity and the weight of the emotional words, and obtaining a sentence feature vector according to the topic attribute and the emotional polarity in the sentence.
Step S6: and (4) constructing a diversity objective function by using the text sentence feature vectors obtained in the step (S5), wherein the objective function ensures that the emotional topic attributes contained in a certain number of sentence sets are selected to be the most, constructing the diversity objective function, the objective function ensures that the emotional topic attributes contained in a certain number of sentence sets are selected to be the most, and the sentence sets are used as viewpoint abstracts, so that the diversity of the finally obtained viewpoint abstracts is ensured to be the best. And aiming at an objective function, selecting sentences to form a viewpoint abstract by adopting a viewpoint sentence selection method based on topic diversity, firstly constructing a viewpoint sentence scoring function fusing topic diversity and sentence importance, and adding the sentences which enable the topic diversity of the viewpoint abstract set to be increased to the maximum into the viewpoint abstract each time the sentences are selected, wherein the number of the sentences is limited within 20 or the number of the sentences is limited by a certain compression ratio.
The above are preferred embodiments of the present invention, and all changes made according to the technical scheme of the present invention that produce functional effects do not exceed the scope of the technical scheme of the present invention belong to the protection scope of the present invention.

Claims (5)

1. A text data viewpoint abstract mining method based on topic diversity is characterized by comprising the following steps: the method comprises the following steps:
step S1: preprocessing topic texts, and filtering irrelevant texts without substantial content and any significance and common stop words;
step S2: inputting a topic corpus and a background corpus;
step S3: extracting topic attributes of the topic corpus;
step S4: adding emotion polarities to the topic attributes obtained in the step S3, wherein the emotion polarities comprise positive emotions and negative emotions, and the positive topic attributes and the negative topic attributes are used as emotion attribute characteristics and are used for vectorizing the sentence;
the step S4 includes the following steps:
taking the obtained topic attribute as an evaluation object, analyzing the emotion polarity of the evaluation object in a sentence by utilizing a dynamic word sequence emotion analysis method facing to multiple evaluation objects, and adding positive or negative emotion polarities to the topic attribute respectively to obtain a positive topic attribute and a negative topic attribute respectively;
step S5: taking the topic attribute obtained in the step S3 as an evaluation object, analyzing the emotion polarity of the evaluation object contained in the sentence by adopting a dynamic word sequence emotion analysis method facing multiple evaluation objects to obtain the emotion attribute characteristics contained in the sentence, wherein if the sentence contains the emotion attribute characteristics, the corresponding characteristic value is 1, and if the sentence does not contain the emotion attribute characteristics, the corresponding characteristic value is 0, so that one sentence is subjected to characteristic vectorization by the topic attribute and emotion analysis method;
the multi-evaluation object oriented dynamic word sequence emotion analysis method is a word bag model based on an emotion dictionary and mainly comprises the following two steps: firstly, cutting a sentence word sequence by using a dynamic word sequence method to obtain a word sequence of each evaluation object contained in a sentence; secondly, matching the word sequence emotional words of each evaluation object by using an emotional dictionary, calculating the emotional tendency of the evaluation object by using the polarity and the weight of the emotional words, and obtaining a sentence feature vector according to the topic attribute and the emotional polarity in the sentence;
step S6: constructing a diversity objective function by using the text sentence feature vector obtained in the step S5;
constructing a diversity objective function in the step S6, wherein the objective function ensures that the emotional topic attributes contained in a certain number of sentence sets are selected to be the most, and the sentence sets are used as viewpoint abstracts, so that the diversity of the finally obtained viewpoint abstracts is ensured to be the best;
the construction of the objective function comprises the following steps: firstly, a viewpoint sentence scoring function fusing topic diversity and sentence importance is constructed, topic attribute differences between sentences and a summary set are considered in the scoring function, the importance of the sentences is fused, the importance of the sentences is obtained by using topic attribute weights and topic attributes contained in the sentences, the sentences with the maximum topic diversity of the viewpoint summary set are added into the viewpoint summary by selection, wherein the number of the sentences is limited within 20 or the number of the sentences is limited by a certain compression ratio.
2. The topic diversity-based text data perspective summary mining method of claim 1, wherein: the filtering rule in step S1 is as follows:
(1) removing the webpage links in the comment sentences;
(2) removing the comment sentences with the character length smaller than 3;
(3) removing common irrelevant words in the comment sentences;
(4) and all English tables are written in small or large.
3. The topic diversity-based text data perspective summary mining method of claim 1, wherein: step S2 includes the following steps: and aiming at the preprocessed text, setting the current topic text as a topic corpus and other topic texts as a background corpus.
4. The topic diversity-based text data perspective summary mining method of claim 1, wherein: step S3 is to calculate the log-likelihood ratio of the words in the topic corpus by means of log-likelihood ratio method, and filter the words by using a threshold value to extract the topic attributes of the topic corpus, wherein the part of speech of the word must be noun, adjective, verb or digit.
5. The topic diversity-based text data perspective summary mining method of claim 1, wherein:
taking the topic attribute with emotion in the step S4 as an emotional topic attribute and using the topic attribute as a feature for sentence feature vectorization in the step S6;
the method flow of the word sequence is briefly described as follows:
step S41: determining the position of an evaluation object in a sentence; for each sentence, the topic attributes in the topic attribute set are used as evaluation objects, and the positions of the evaluation objects in the sentence are determined according to the front-to-back direction of the sentence, wherein the positions are from small to large;
step S42: by evaluating the position of the object in the sentence
Figure DEST_PATH_IMAGE001
Expanding towards the left and right directions for the center until meeting punctuation marks or other evaluation objects;
if the punctuation marks are encountered leftwards or rightwards, intercepting the punctuation marks to a left word sequence or a right word sequence in the evaluation object;
if meeting other evaluation objects leftwards or rightwards, taking the middle coordinate of the position coordinates of the two evaluation objects, and intercepting the word sequence from the middle coordinate to the position of the evaluation object as a left word sequence or a right word sequence;
step S423: obtaining a left word sequence of a certain evaluation object after the steps
Figure DEST_PATH_IMAGE002
And a sequence of right words
Figure DEST_PATH_IMAGE003
Combining the left word sequence and the right word sequence to obtain a complete word sequence of the evaluation object
Figure DEST_PATH_IMAGE004
Wherein
Figure DEST_PATH_IMAGE005
Is an evaluation object, is removed during emotion analysis, and has a parameter range
Figure DEST_PATH_IMAGE006
The specific values of the two are dynamically changed and have no fixed value, and the parameter values of the two are different for two different word sequences; at the same time, for any two dynamic word sequences
Figure DEST_PATH_IMAGE007
And
Figure DEST_PATH_IMAGE008
satisfies the conditions
Figure DEST_PATH_IMAGE009
I.e. a sequence of words where the two sequences of words do not coincide.
CN201810166896.9A 2018-02-28 2018-02-28 Topic diversity-based text data viewpoint abstract mining method Active CN108268668B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810166896.9A CN108268668B (en) 2018-02-28 2018-02-28 Topic diversity-based text data viewpoint abstract mining method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810166896.9A CN108268668B (en) 2018-02-28 2018-02-28 Topic diversity-based text data viewpoint abstract mining method

Publications (2)

Publication Number Publication Date
CN108268668A CN108268668A (en) 2018-07-10
CN108268668B true CN108268668B (en) 2022-01-18

Family

ID=62774300

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810166896.9A Active CN108268668B (en) 2018-02-28 2018-02-28 Topic diversity-based text data viewpoint abstract mining method

Country Status (1)

Country Link
CN (1) CN108268668B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766434B (en) * 2018-12-29 2020-12-11 北京百度网讯科技有限公司 Abstract generation method and device
CN110457672B (en) * 2019-06-25 2023-01-17 平安科技(深圳)有限公司 Keyword determination method and device, electronic equipment and storage medium
CN110941963A (en) * 2019-11-29 2020-03-31 福州大学 Text attribute viewpoint abstract generation method and system based on sentence emotion attributes
CN110889292B (en) * 2019-11-29 2022-06-03 福州大学 Text data viewpoint abstract generating method and system based on sentence meaning structure model
CN111209363B (en) * 2019-12-25 2024-02-09 华为技术有限公司 Corpus data processing method, corpus data processing device, server and storage medium
CN111309864B (en) * 2020-02-11 2022-08-26 安徽理工大学 User group emotional tendency migration dynamic analysis method for microblog hot topics
CN113051928B (en) * 2021-03-17 2023-08-01 卓尔智联(武汉)研究院有限公司 Block chain-based comment detection method and device and electronic equipment
CN113268660B (en) * 2021-04-28 2023-04-07 重庆邮电大学 Diversity recommendation method and device based on generation countermeasure network and server
CN113535942B (en) * 2021-07-21 2022-08-19 北京海泰方圆科技股份有限公司 Text abstract generating method, device, equipment and medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617158A (en) * 2013-12-17 2014-03-05 苏州大学张家港工业技术研究院 Method for generating emotion abstract of dialogue text
CN105912644A (en) * 2016-04-08 2016-08-31 国家计算机网络与信息安全管理中心 Network review generation type abstract method
CN106227768A (en) * 2016-07-15 2016-12-14 国家计算机网络与信息安全管理中心 A kind of short text opining mining method based on complementary language material

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140188665A1 (en) * 2013-01-02 2014-07-03 CrowdChunk LLC CrowdChunk System, Method, and Computer Program Product for Searching Summaries of Online Reviews of Products

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103617158A (en) * 2013-12-17 2014-03-05 苏州大学张家港工业技术研究院 Method for generating emotion abstract of dialogue text
CN105912644A (en) * 2016-04-08 2016-08-31 国家计算机网络与信息安全管理中心 Network review generation type abstract method
CN106227768A (en) * 2016-07-15 2016-12-14 国家计算机网络与信息安全管理中心 A kind of short text opining mining method based on complementary language material

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Identify Sentiment-Objects from Chinese Sentences Based on Skip Chain;Minjie Zheng等;《IEEE》;20120910;全文 *
基于PageRank的中文多文档文本情感摘要;林莉媛;《中文信息学报》;20140331;全文 *

Also Published As

Publication number Publication date
CN108268668A (en) 2018-07-10

Similar Documents

Publication Publication Date Title
CN108287922B (en) Text data viewpoint abstract mining method fusing topic attributes and emotional information
CN108268668B (en) Topic diversity-based text data viewpoint abstract mining method
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN104102626B (en) A kind of method for short text Semantic Similarity Measurement
CN106484664A (en) Similarity calculating method between a kind of short text
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
Alkhatlan et al. Word sense disambiguation for arabic exploiting arabic wordnet and word embedding
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN111680488A (en) Cross-language entity alignment method based on knowledge graph multi-view information
CN102929861A (en) Method and system for calculating text emotion index
CN104915443B (en) A kind of abstracting method of Chinese microblogging evaluation object
CN103678275A (en) Two-level text similarity calculation method based on subjective and objective semantics
CN105528437A (en) Question-answering system construction method based on structured text knowledge extraction
CN110879834B (en) Viewpoint retrieval system based on cyclic convolution network and viewpoint retrieval method thereof
CN104298732B (en) The personalized text sequence of network-oriented user a kind of and recommendation method
Tiwari et al. Ensemble approach for twitter sentiment analysis
CN111488429A (en) Short text clustering system based on search engine and short text clustering method thereof
CN110889292B (en) Text data viewpoint abstract generating method and system based on sentence meaning structure model
CN111460158A (en) Microblog topic public emotion prediction method based on emotion analysis
Al-Saqqa et al. Stemming effects on sentiment analysis using large arabic multi-domain resources
CN109726402A (en) A kind of document subject matter word extraction method
KR101326313B1 (en) Method of classifying emotion from multi sentence using context information
Singh et al. Words are not equal: Graded weighting model for building composite document vectors
Gupta et al. Keyword extraction: a review
CN109299463B (en) Emotion score calculation method and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant