CN110889292B

CN110889292B - Text data viewpoint abstract generating method and system based on sentence meaning structure model

Info

Publication number: CN110889292B
Application number: CN201911205403.9A
Authority: CN
Inventors: 廖祥文; 李晓滨; 陈志豪; 陈癸旭; 吴运兵
Original assignee: Fuzhou University
Current assignee: Fuzhou University
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2022-06-03
Anticipated expiration: 2039-11-29
Also published as: CN110889292A

Abstract

The invention relates to a method and a system for generating a viewpoint abstract by text data based on a sentence meaning structure model, which comprises the steps of firstly extracting a data set to be processed on a website and preprocessing the data set; then, a topic corpus set and a background corpus set are constructed, and topic attributes are extracted; then semantic weight calculation is carried out to obtain semantic weight values of the sentences; then, performing association weight calculation to obtain an association weight value of the sentence; and finally, extracting the viewpoint abstract in the topic by utilizing the topic attribute, the semantic weight value and the associated weight value. The topic text abstract method based on the topic attribute solves the problems existing in the current research method based on the topic attribute and the emotion information thereof, can efficiently and accurately obtain the topic abstract of the topic text, and can be applied to a larger-scale data set application scene.

Description

Text data viewpoint abstract generating method and system based on sentence meaning structure model

Technical Field

The invention relates to the technical field of internet big data analysis, in particular to a method and a system for generating a viewpoint abstract of text data based on a sentence meaning structural model.

Background

With the development of the internet, more and more messages are acquired from the internet by people, and the proportion of data in the fields of microblog, website news, commodity comments and the like in the network life of people is larger and larger. In order to bring more efficient reading and screening experience to people, the abstract part of the web text is often extracted for users to preview, the early work is completed manually, and as the data is increasingly huge, people begin to adopt a method of automatic machine extraction to generate the abstract.

Conventional methods for automatically generating summaries include the use of point of view summary models, including graph models and ranking models. The representation method of the graph model comprises methods such as Textrank, PageRank and LexRank, sentences are used as nodes, a certain relation between the sentences is used as the weight of an edge, iterative updating calculation is carried out on scores of the sentences through a random walk model, scoring of the sentences is achieved, a certain number of sentences with high scores are selected to be combined into a viewpoint abstract, a ranking model is used for constructing a sentence scoring function to achieve scoring of the sentences from the consideration factors such as diversity and redundancy of the viewpoint abstract, or a KL divergence and MMR method are used for carrying out relative score ranking on the sentences, and the viewpoint abstract is obtained through score ranking. The two methods ignore the text topic attribute with finer granularity, consider the diversity of the text subject matter through the diversity of all words in the text, do not consider the influence of the keywords of the text subject matter on the view abstract, and limit the follow-up research of the model to a certain extent.

At present, researchers at home and abroad do not research on the viewpoint abstract models, and both a generative viewpoint abstract model and a viewpoint abstract model based on a submode function are proposed. The method has a good effect, but the time complexity of algorithm solution is too high, and it takes several times of time of other methods for a short data set, and the method can not be applied to an actual scene under a big data background. The view abstraction method based on the submodular function ensures that the obtained local solution can be not lower than 63% of the optimal solution by using the greedy algorithm through the submodular function property, the greedy algorithm considers the conditions of various elements to select sentences, and although the experimental effect is relatively good, the mode of manually constructing the corpus tree is not suitable for wider application scenes.

Most of existing models consider that the diversity of all words in a text sentence is utilized to ensure that a viewpoint abstract covers the text motif, the diversity of the abstract is ensured through the diversity of the words, but the diversity of the words cannot ensure that the viewpoint abstract covers the motif of a source text, and words irrelevant to the motif can influence the finally generated viewpoint abstract.

Disclosure of Invention

In view of the above, the present invention provides a method and system for generating a viewpoint abstract of text data based on a sentence meaning structure model, which extracts syntactic related words from a source text by an entity extraction method as text subject key words, researches emotion information about effective words as evaluation objects in each sentence by combining an emotion analysis research method, and selects sentences to combine into a viewpoint abstract by a viewpoint abstract selection method based on sentence importance, so that the emotion of the whole viewpoint abstract is most clear, and the extracted abstract is most appropriate to the text subject.

The invention is realized by adopting the following scheme: a method for generating a viewpoint abstract based on text data of a sentence meaning structure model specifically comprises the following steps:

extracting a data set to be processed on a website, and preprocessing the data set;

constructing a topic corpus set and a background corpus set, and extracting topic attributes;

semantic weight calculation is carried out to obtain semantic weight values of sentences;

performing association weight calculation to obtain an association weight value of the sentence;

and extracting the viewpoint abstract in the topic by using the topic attribute, the semantic weight value and the associated weight value.

Further, the data set to be processed on the website includes, but is not limited to, microblog data, website news data, and commodity comment data.

Further, the pretreatment specifically comprises:

removing the webpage links in the comment sentences;

removing comment sentences of which the character length is less than 3;

removing common irrelevant words in the comment sentences;

all english is uniformly expressed as lower case english.

Further, the establishing of the topic corpus set and the background corpus set, and the extracting of the topic attributes specifically include: setting the current topic text as a topic corpus and other topic texts as a background corpus aiming at the preprocessed text, calculating the log likelihood ratio of words in the topic corpus by means of a log likelihood ratio method, filtering the words by using a preset threshold, wherein the part of speech of the words is required to be nouns, adjectives, verbs and digital words, and extracting the topic attributes of the topic corpus.

Further, the semantic weight calculation comprises the following steps:

step S11: calculating the emotion score of each sentence as the emotion characteristics by using an emotion analysis method based on an emotion dictionary;

step S12: extracting lexical features by using a semantic word extraction method based on a semantic dictionary;

step S13: analyzing sentences by using BFS-CSA to obtain sentence meaning structural characteristics;

step S14: semantic weights are calculated for the sentences.

Further, step S14 is specifically: using sentence meaning structural characteristics F₆And calculating semantic weight of the sentence by the lexical characteristics, wherein the lexical characteristics are divided into 5 types which are respectively average TFIDF (fuzzy binary field decomposition) POS (point of sale) word weight F of effective words of the sentence₁Coverage rate of topic words in sentence F₂The predicate of the sentence contains the number F of the topic words₃The general format of the sentence includes the number F of the topic word₄And the number F of effective words in the sentence containing emotional words₅(ii) a The semantic weight value calculation method comprises the following steps:

in the formula, P_con(S) is the semantic weight of sentence S, F_iAnd mu_iRespectively representing semantic feature values of the sentence and weighting coefficients of the feature.

Further, the association weight calculation comprises the steps of:

step S21: dividing words by using a sentence meaning structure to generate word vector representation so as to obtain a representation vector of a sentence;

step S22: calculating cosine similarity of expression vectors of the sentences to obtain similarity of the two sentences;

step S23: and constructing a sentence graph model by taking sentences of the document set as nodes, relations among the sentences as edges and similarity among the sentences as weight values, and obtaining the associated weight values of the sentences through semantic overlap ratios of other sentences to the sentences.

Further, the associated weight value R of the sentence in step S23 (S)_k,S_j) The calculation was as follows:

R(S_k,S_j)＝P_con(S_j)*s(S_k,S_j)；

in the formula, S (S)_k,S_j) Representing a sentence S_jFor sentence S_kSimilarity of (A), P_con(S_j) Representing a sentence S_jThe semantic weight value of (3).

Further, extracting the view abstract in the topic by using the topic attribute, the semantic weight value and the associated weight value specifically comprises: the average sentence weight of each topic is obtained by weighting the semantic weight value and the association weight value, and finally 20 sentences with the highest score are selected as the viewpoint abstract.

The invention provides a text data generation viewpoint summarization system based on a sentence meaning structure model, comprising a memory, a processor and a computer program stored on the memory and capable of being executed by the processor, wherein the computer program when executed by the processor implements the method steps as described above.

Compared with the prior art, the invention has the following beneficial effects: the invention extracts syntactic related words from a source text as key words of the main part of the text by an entity extraction method, researches the emotional information about effective words as evaluation objects in each sentence by combining an emotion analysis research method, and selects the sentences to combine into a viewpoint abstract by a viewpoint abstract selection method based on the importance of the sentences, so that the emotion of the whole viewpoint abstract is most vivid and the main part of the text is most appropriate. The topic text abstract method based on the topic attribute solves the problems existing in the current research method based on the topic attribute and the emotion information thereof, can efficiently and accurately obtain the topic abstract of the topic text, and can be applied to a larger-scale data set application scene.

Drawings

FIG. 1 is a schematic diagram of the method of the embodiment of the present invention.

Detailed Description

The invention is further explained below with reference to the drawings and the embodiments.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

As shown in fig. 1, the present embodiment provides a method for generating a viewpoint summary based on text data of a sentence meaning structure model, which specifically includes the following steps:

constructing a topic corpus and a background corpus, and extracting topic attributes;

performing semantic weight calculation to obtain a semantic weight value of the sentence;

In this embodiment, the data set to be processed on the website includes, but is not limited to, microblog data, website news data, and commodity comment data.

In this embodiment, the pretreatment specifically includes:

remove web page links in the comment sentence, such as "http:// t.cn/rcwwWYQZ";

removing comment sentences with the character length smaller than 3, wherein the comment sentences contain too little information, most of the comment sentences are emoticons, and no other useful information exists;

removing common irrelevant words in the comment sentences, such as 'group pictures', 'original text forwarding' and the like;

all English is uniformly expressed as lowercase English.

In order to enable the application to be wider, on the basis of original data, data are cleaned, irrelevant texts are filtered, so that the topic attributes extracted by adopting the topic attribute extraction method are more accurate, and the method is not only applied to the field of Chinese microblogs, but also can be applied to the field of website news and commodity comments.

In this embodiment, the establishing of the topic corpus set and the background corpus set, and the extracting of the topic attributes specifically include: setting the current topic text as a topic corpus and other topic texts as a background corpus aiming at the preprocessed text, calculating the log likelihood ratio of words in the topic corpus by means of a log likelihood ratio method, filtering the words by using a preset threshold, wherein the part of speech of the words is required to be nouns, adjectives, verbs and digital words, and extracting the topic attributes of the topic corpus.

In this embodiment, the semantic weight calculation includes the following steps:

step S14: semantic weights are calculated for the sentences.

Preferably, the current topic text is set as a topic corpus, other topic texts are set as a background corpus, a positive emotion attribute set and a negative emotion attribute set are obtained based on an emotion dictionary, such as a positive emotion word dictionary, a negative emotion word dictionary, various part-of-speech dictionaries and the like, and an emotion score of a sentence is calculated by using an emotion analysis method based on the emotion dictionary to serve as an emotion attribute feature. The emotion vocabulary body comprises 7 types which are nouns (noun), verbs (verb), adjectives (adj), adverbs (adv), network words (nw), idioms (idiom) and prepositions phrases (prep), a part-of-speech set is obtained by utilizing a semantic word extraction method of a semantic dictionary in the first step, and the part-of-speech set is matched with semantic words in sentences to obtain lexical characteristics in the second step. And analyzing sentence meaning structure by using BFS-CSA analysis sentences.

In this embodiment, step S14 specifically includes: using sentence meaning structural characteristics F₆And calculating semantic weight of the sentence by the lexical characteristics, wherein the lexical characteristics are divided into 5 types which are respectively average TFIDF (fuzzy binary field decomposition) POS (point of sale) word weight F of effective words of the sentence₁Coverage rate of topic words in sentence F₂The predicate of the sentence contains the number F of the topic words₃The general format of the sentence includes the number F of the topic word₄And the number F of effective words in the sentence containing emotional words₅(ii) a The semantic weight value calculation method comprises the following steps:

in the formula, P_con(S) is the semantic weight of sentence S, F_iAnd mu_iRespectively representing semantic feature values of a sentence and theA weighting factor for the feature. And obtaining a semantic weight value through the characteristic value and the characteristic weighting coefficient.

Wherein, F₁And F₂For the statistical characteristics of effective words of a sentence, generally, it is considered that the noun (noun) and the verb (verb) are more obvious in importance in the sentence, and are more important than other parts of speech, the weight is given to 2, and the other parts of speech are 1; predicates, adverbs and the like are the core contents of the sentence, and if the characteristics of the sentence are in the topic table, the closer the relation between the sentence and the topic is, the more the significance of the topic center can be expressed; the topic words contained in the general lattice are selected as features and are used as supplements of predicates. In this embodiment, the calculating of the association weight includes the following steps:

Preferably, an n-dimensional space vector V (S) of the sentence is constructed by the obtained sentence similarity_k)＝{ωk，1,ω_k，2,…,ω_k，nAnd constructing a graph model, wherein each node S in the graph corresponds to a sentence, the degree d of the node S is the number of edges connected with S, the importance degree of S contained information is reflected, and the larger d is, the more sentences related to the sentence are, the more important the contained information of the sentence is.

In the present embodiment, the association weight value R of the sentence in step S23 (S)_k,S_j) The calculation was as follows:

R(S_k,S_j)＝P_con(S_j)*s(S_k,S_j)；

in the formula, S (S)_k,S_j) Representing a sentence S_jFor sentence S_kSimilarity of (A), P_con(S_j) Representing a sentence S_jThe semantic weight value of (2).

In this embodiment, extracting the view abstract in the topic by using the topic attribute, the semantic weight value, and the association weight value specifically includes: the average sentence weight of each topic is obtained by weighting the semantic weight value and the association weight value, and finally 20 sentences with the highest score are selected as the viewpoint abstract.

Specifically, the selection of sentences is ordered from high to low according to the importance degree, the extraction sequence of the sentences is determined, and the sentences are extracted according to the importance and the redundancy of the sentences. And obtaining the average weight of the sentence by utilizing the semantic weight value and the association weight value, and ensuring the relevance by utilizing the topic attribute. The importance of a sentence is related to two factors: 1) the number of topic attributes contained in the sentence is increased, and the importance of the sentence is increased when the number is increased;

2) the larger the average weight of a sentence, the more important the sentence is. A scoring function is constructed based on these two factors, and the weight θ of both is set, and it is generally considered that both are equally important, where θ is 0.5. The sentences ranked 20 top by the sentence importance score are taken as the viewpoint summary, and the viewpoint summary sentence set S.

The present embodiment provides a text data generation viewpoint summarization system based on a sentence meaning structure model, comprising a memory, a processor and a computer program stored on the memory and capable of being executed by the processor, wherein the computer program when executed by the processor implements the method steps as described above.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing is directed to preferred embodiments of the present invention, other and further embodiments of the invention may be devised without departing from the basic scope thereof, and the scope thereof is determined by the claims that follow. However, any simple modification, equivalent change and modification of the above embodiments according to the technical essence of the present invention are within the protection scope of the technical solution of the present invention.

Claims

1. A method for generating a viewpoint abstract based on text data of a sentence meaning structure model is characterized by comprising the following steps:

extracting viewpoint abstract from the topic by using the topic attribute, the semantic weight value and the associated weight value;

the semantic weight calculation comprises the following steps:

step S14: calculating semantic weight of the sentence;

step S14 specifically includes: using sentence meaning structural characteristics F₆And calculating semantic weight of the sentence by the lexical characteristics, wherein the lexical characteristics are divided into 5 types which are respectively average TFIDF (fuzzy binary field decomposition) POS (point of sale) word weight F of effective words of the sentence₁Coverage rate of topic words in sentence F₂And the predicate of the sentence contains the number F of topic words₃The general format of the sentence includes the number F of the topic word₄And the number F of effective words in the sentence containing emotional words₅(ii) a The semantic weight value calculation method comprises the following steps:

in the formula, P_con(S) is the semantic weight of sentence S, F_iAnd mu_iRespectively representing semantic feature values of the sentences and weighting coefficients of the features;

the association weight calculation comprises the steps of:

step S23: the method comprises the steps of taking sentences in a document set as nodes, taking the relation among the sentences as an edge, taking the similarity among the sentences as a weight value to construct a sentence graph model, and obtaining the associated weight value of the sentence through the semantic overlap ratio of other sentences to the sentence;

the associated weight value R of the sentence in step S23 (S)_k,S_j) The calculation was as follows:

R(S_k,S_j)＝P_con(S_j)*s(S_k,S_j)；

in the formula, S (S)_k,S_j) Representing a sentence S_jFor sentence S_kSimilarity of (A), P_con(S_j) Representing a sentence S_jThe semantic weight value of (1);

the method for extracting the viewpoint abstract in the topic by utilizing the topic attribute, the semantic weight value and the associated weight value specifically comprises the following steps: the average sentence weight of each topic is obtained by weighting the semantic weight value and the association weight value, and finally 20 sentences with the highest score are selected as the viewpoint abstract.

2. The method for generating the opinion summary based on the text data of the sentence meaning structure model as claimed in claim 1, wherein the data sets to be processed on the website include microblog data, website news data and commodity comment data.

3. The method for generating a viewpoint summary of text data based on sentence meaning structure model as claimed in claim 1, wherein the preprocessing is specifically:

removing the webpage links in the comment sentences;

removing the comment sentences with the character length smaller than 3;

removing common irrelevant words in the comment sentences;

all english is uniformly expressed as lower case english.

4. The method for generating a viewpoint summary of text data based on a sentence meaning structure model as claimed in claim 1, wherein the constructing topic corpus set and the background corpus set and extracting topic attributes specifically are: setting the current topic text as a topic corpus and other topic texts as a background corpus aiming at the preprocessed text, calculating the log likelihood ratio of words in the topic corpus by means of a log likelihood ratio method, filtering the words by using a preset threshold, wherein the part of speech of the words is required to be nouns, adjectives, verbs and digital words, and extracting the topic attributes of the topic corpus.

5. A system for generating a point of view summary of text data based on a sentence meaning model, comprising a memory, a processor and a computer program stored on the memory and executable by the processor, the computer program when executed by the processor implementing the method steps of any one of claims 1 to 4.