CN111259136A - Method for automatically generating theme evaluation abstract based on user preference - Google Patents

Method for automatically generating theme evaluation abstract based on user preference Download PDF

Info

Publication number
CN111259136A
CN111259136A CN202010022473.7A CN202010022473A CN111259136A CN 111259136 A CN111259136 A CN 111259136A CN 202010022473 A CN202010022473 A CN 202010022473A CN 111259136 A CN111259136 A CN 111259136A
Authority
CN
China
Prior art keywords
word
feature
characteristic
document
occurrence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010022473.7A
Other languages
Chinese (zh)
Other versions
CN111259136B (en
Inventor
何为
刘楠
马文鹏
李银
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xinyang Normal University
Original Assignee
Xinyang Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xinyang Normal University filed Critical Xinyang Normal University
Priority to CN202010022473.7A priority Critical patent/CN111259136B/en
Publication of CN111259136A publication Critical patent/CN111259136A/en
Application granted granted Critical
Publication of CN111259136B publication Critical patent/CN111259136B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Abstract

The invention discloses a method for automatically generating a theme evaluation abstract based on user preference, which comprises the steps of collecting the past evaluation information of a customer, analyzing an evaluation sample document of a field of interest provided by the customer, utilizing semantic relation implied by the co-occurrence relation of word pairs in the sample document, and establishing a feature database of the co-occurrence word pairs by calculating the co-occurrence rate of the word pairs; and by utilizing the characteristic database, clustering of characteristic word chains and calculation of similarity are carried out on the target text, so that the query type automatic abstract which is interested by customers is provided. The method is not limited to the field of catering evaluation, and can also be used for recommending other activities such as online shopping consumption and travel and information retrieval in public and professional fields.

Description

Method for automatically generating theme evaluation abstract based on user preference
Technical Field
The invention relates to an automatic abstract processing technology of natural language processing direction in the field of computer research, in particular to a method for automatically generating a theme evaluation abstract based on user preference.
Background
With the continuous development of the internet and information technology, the research on automatic summarization enters the unprecedented prosperity stage. According to the users, the automatic Summarization can be divided into general automatic Summarization (Generic Summarization) and Query-based automatic Summarization (Query-binary Summarization). The Query-type automatic Summarization (a Summarization with corresponding emphasis is provided according to needs or interests of users, also called a User-focused Summarization), a Topic-focused Summarization (Topic-focused Summarization) or a Query-focused Summarization (Query-focused Summarization), compared with a general Summarization focusing on overall main content of a full-text main body, the Query-type automatic Summarization reflects more individual requirements of users.
The query-type automatic summarization is researched from 80 s of the 20 th century in the earliest abroad, the query-type automatic summarization is an important component of the automatic summarization, and because the query-type automatic summarization and the general automatic summarization have similar limitations on the scale of results, a method for extracting similar semantic sentences is mostly adopted to form summaries of related topics. The term "query-summary for query-summary association of the web" (by v. plachouras and i.oins, published in 2005 Information report, vol 8, 2) studies methods for improving query accuracy in web pages using auto-summarization. An article, which is an attribute-oriented decision on the underfilling effects of query-binary searching in web search (authors are r.white, j.jose and i.ruthenn, published in Information Processing & Management 2003, vol. 39, No. 5) proposes that query-type auto-digests can be generated according to the frequency of occurrence of query words in each sentence of a web page and text styles; the application of the keyword density distribution method in the abstract of the article (the author is Yan Invitrogen, forest hongfei, Yangxihao, Zhao Jing, published in 2007 computer engineering, Vol. 33, No. 6) adopts the keyword density algorithm to generate the query-type automatic abstract.
The key step of query-based automatic summarization is how to obtain the biased topics of the query, and the common strategy for obtaining the biased topics is to perform expansion on the concept semantics of the query of the user. Some scholars use the word frequency and position of the user query keyword in the article to obtain the weight, but the method can only mechanically obtain the result and cannot meet the semantic requirement. It is difficult to accurately define the true interests of a user's query simply by means of simple query terms. The ideal method is to expand the query words of the user by adopting general semantic resources, and at present, no applicable general semantic resource exists, and some scholars use the existing semantic library to obtain the bias, such as WordNet in English, HowNet in Chinese, synonym forest and the like as the semantic resource library. Similarity between feature words is calculated using a knowledge network as an article a sequence Selection Method of Query-Based Chinese Multi-Document summerization (authors x.song, j.huang, j.zhou and h.zhang, published in 2009, proceedings of paciia2009) and used to guide abstract Sentence Selection for Query-Based auto-Summarization. Therefore, the method is easily limited by the scale of the semantic library and the updating speed, and is difficult to meet the challenge of mass newly added words of the Internet.
With the wide application of the mobile internet, when a dining place is selected, the evaluation information of other people on the dining room is checked on the network, and the evaluation of the people on the dining is published after meal, so that the mobile internet becomes a popular life style for young people. The existing catering evaluation websites such as the public comment network, the American group network and the like have collected a large amount of evaluation information of diners, and the evaluation information becomes an evaluation basis for selecting dining places by others. However, when other people select a restaurant, the people need to check and read a plurality of comment information one by one to know whether the comment information meets the needs of the people. Some websites require users to score when submitting evaluations, and the collected scores are used as recommendation indexes, but the individual needs, consumption habits and tastes of the users are different. This simple scoring value does not provide a basis for the user to select a restaurant that is appropriate for the user.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a method for automatically generating a theme evaluation abstract based on user preference, which is used for a user to select actual application scenes of restaurants, hotels, tourism and the like according to evaluation.
The technical scheme adopted for realizing the aim of the invention is as follows:
a method for automatically generating a theme evaluation abstract based on user preference comprises the following steps:
step 101, collecting an evaluation text published on the internet by a user based on a specific scene in the past as a sample document;
step 102, preprocessing a sample document;
step 103, calculating word pair co-occurrence rate from the preprocessed sample document;
step 104, storing the word pair co-occurrence rate into a feature database;
step 105, when the past evaluation texts of the user are lacked or the user has other preference requirements, manually inputting a plurality of feature keywords by the user, and storing the keywords into a feature database as pairwise associated co-occurrence word pairs;
step 106, collecting evaluation texts of other corresponding users in the specific scene in accordance with the user selection range, and respectively summarizing to generate target documents;
step 107, preprocessing the target document;
step 108, extracting a characteristic word set of the target document, searching the distance of each vocabulary in the characteristic word set from the characteristic database, and generating a characteristic word chain;
step 109, dividing a single sentence from the target document;
and 110, calculating the similarity between the characteristic words contained in each single sentence in the target document and the characteristic word chain, and sequentially selecting the single sentence with the highest similarity of each characteristic word chain to generate the abstract according to the similarity relation between the single sentence and each characteristic word chain.
Preferably, in step 103, a specific calculation method of the word pair co-occurrence rate in the 1 sample document d is as follows:
any two words w in the sample document diAnd wjWord pair co-occurrence rate P ofd(wi,wj) Calculated by the following formula:
Figure BDA0002361299250000041
wherein, T is the window unit set in the sample document d; w is the set of words W ═ { W ═ in the sample document d1,w2,…,wn},wiAnd wjThe words are any two words in the word set W, and i, j and n are positive integers; sd(wi) Indicating that w is contained in the sample document diThe number of window units of (a); sd(wj) Indicating that w is contained in the sample document djThe number of window units of (a); sd(wi,wj) Representing the sample document d simultaneously containing wiAnd wjThe number of window units of (a); n is a radical oft(wi,wj) Represented in a certain window unit t by wiAnd wjThe number of co-occurrences, T ∈ T, where w isiAnd wjWhen the number of occurrences in the same paragraph is different, take wiAnd wjThe minimum number of occurrences of (c) is the number of co-occurrences.
Preferably, in step 103, k sample documents { d }1,d2,…,dkThe specific calculation method of the total co-occurrence rate of the words in the Chinese character is as follows:
Figure BDA0002361299250000042
wherein k is a positive integer greater than 1; pk-1(wi,wj) For the first k-1 sample documents { d1,d2,…,dk-1The word w iniAnd wjThe total co-occurrence rate of word pairs of; sk-1(wi,wj) For the first k-1 sample documents { d1,d2,…,dk-1The word pair w iniAnd wjThe total number of window units that co-occur,
Figure BDA0002361299250000043
Figure BDA0002361299250000044
for the kth sample document dkWord w iniAnd wjThe word pair co-occurrence rate of;
Figure BDA0002361299250000045
representing the kth sample document dkIn the same time contains wiAnd wjThe number of window units of (a); d is sample document space { D1,d2,…,dk}。
Preferably, the feature database is composed of feature word and co-occurrence word pairs.
Preferably, the searching for the distance between each vocabulary in the feature word set from the feature database to generate a feature word chain includes: for a target document d to be generated with the abstract, searching the distance of each vocabulary in the feature word set from the feature database and the target document d, and generating a feature word chain, wherein the specific steps are as follows:
stored in the feature database is a feature word set V ═ V1,v2,…,vmE (v) and co-occurrence word pair set E ═ E (v)i,vj) Any co-occurrence word pair E (v) in Ei,vj) All the structures are Word Co-occurence structures and comprise the numbers, the Co-occurrence rates and the Co-occurrence times of two vocabularies;
the target document d contains a characteristic word set W ═ W1,w2,…,wn}; for any word pair in W (W)i,wj) If w isi∈V、wjE.g., V, and the word pair is contained in E, i.e., there is E (w)i,wj) Then the degree of association l of these two wordsijIs e (w)i,wj) The corresponding co-occurrence rate of (1); if wiOr wjThe association degree of the two words is as follows: lij=Pd(wi,wj) (ii) a Thus, a target document is generatedd association degree set L ═ L of all word pairs in feature word set W11,l12,…,l1n,l21,…l2n,…,l(n-1)n};
According to the association degree between the word pairs, a plurality of characteristic word chains are constructed through a clustering method, so that a plurality of related topics of the target document d are shown; and summarizing the characteristic word chains to form a characteristic word chain set C according to the sequence of the characteristic word chains generated in clustering.
Preferably, the similarity between the feature words contained in the single sentence and the feature word chain is calculated by the following specific steps:
the target document d comprises a set of single sentences S and a characteristic word set W ═ W1,w2,…,wn}; according to single sentences S in the single sentence set SqWhether the feature word in W is included or not, s can be obtainedqRelation with each feature word in W
Figure BDA0002361299250000051
Wherein:
Figure BDA0002361299250000052
the target document d comprises a characteristic word chain set C and a characteristic word set W ═ W1,w2,…,wn}; according to word chain C in word chain set CpIf the feature words in the feature word set W are contained, c can be obtainedpRelation with each feature word in W
Figure BDA0002361299250000053
Wherein:
Figure BDA0002361299250000061
simple sentence sqAnd word chain cpThe similarity of (d) is calculated by the following formula:
Figure BDA0002361299250000062
the invention has the beneficial effects that:
the method overcomes the defect that the theme which is interested by the user is difficult to be described when the query type automatic abstract is obtained by expanding the query words provided by the user in the prior art. By extracting word pairs in the subject text, the implicit connection between the characteristic words of the subject text can be found, and the content really interested by the user can be extracted from the target text to generate the query type automatic abstract. By means of the overall clustering of the evaluation texts, the method can avoid the deviation of subjective opinions of individual users, and enables the conclusion to be more objective and fair.
The invention has the effect that the semantic relation implied by the co-occurrence relation of the word pairs in the sample document is utilized, the co-occurrence frequency and the co-occurrence position of the word pairs contain the implicit relation between the word pairs and the theme, and the more the number of the occurrence times of a certain word pair in the window unit is, the closer relation of the word pair in the theme is shown. By calculating the co-occurrence rate of the word pairs in the single text and the plurality of texts, the generated co-occurrence word pair has the characteristics of expandability and updatability for the theme feature library, and the interesting content of the user can be more obviously embodied along with the increase of the number of the theme texts.
The invention discloses a method for automatically generating an evaluation abstract of a restaurant by analyzing the existing evaluation text information of other users of the restaurant according to the interest preference of the past evaluation display of the user. The method is characterized in that an incidence relation in user evaluation is mined in a co-occurrence word pair mode and is used as a basis for selecting abstract sentences. The method is not limited to the field of catering evaluation, and is also suitable for recommending other activities such as online shopping consumption, travel, accommodation and the like.
Drawings
FIG. 1 is a flowchart of a method of example 1 of the present invention.
FIG. 2 is a flowchart of a method of embodiment 2 of the present invention.
Detailed Description
The invention is further described in detail below with reference to the drawings and the detailed description.
The co-occurrence analysis of words is one of the successful applications of natural language processing technology in information retrieval, and the core idea of the method is that the co-occurrence frequency between words reflects the semantic association between the words to some extent. The study of co-occurrence was based on the following assumptions: in a large-scale corpus of text, two words are considered semantically related if they frequently appear together in the same window unit (e.g., a document, a natural paragraph, a sentence), and the more frequently they occur together, the closer the semantics of the two words are.
When the co-occurrence rate exceeds a certain threshold, the feature word pair is at least related to one topic. Since a document may be composed of one or more topics, the topics may be divided by window units. According to the habits of people on organizing languages, most articles have the characteristics of definite theme and compact structure, and the theme of the articles is usually composed of one or more natural sections, so that window units are divided according to the natural sections.
Two words appearing in a certain distance of an article form a co-occurrence word pair, the co-occurrence frequency and the co-occurrence position of the word pair contain implicit relation between the word pair and a theme, and the more the number of the appearance times of a certain word pair in a window unit, the closer relation between the word pair and the theme is shown, and the word pair can be two words related to semantics or fixed collocation and the like. And the two words appear in different window units respectively, and do not necessarily have higher contribution to the theme because the words may belong to different themes.
The method utilizes semantic relation implied by the co-occurrence relation of the word pairs in the sample document, establishes a subject feature library of the co-occurrence word pairs by calculating the co-occurrence rate of the word pairs, and realizes an extensible and updated semantic resource library facing the user interested field; designing an inquiry type automatic abstract extraction method by utilizing the theme feature library and through clustering of feature word chains and calculation of similarity; the method limits that paragraphs are taken as window units, and has a good abstract effect on articles with traditional chapter structures.
Example 1:
as shown in fig. 1, the present embodiment provides a method for automatically generating a topic evaluation summary based on user preferences, which includes the following steps:
step S1, collecting the evaluation text published on the network by the user based on the specific scene as the sample document as the basis of mining the user preference; the specific scenes comprise restaurants, lodging, tourism, online shopping and the like;
step S2, preprocessing the sample document, including but not limited to word segmentation and stop word removal;
step S3, calculating word pair co-occurrence rate from the preprocessed sample document;
step S4, storing the word pair co-occurrence rate into a feature database;
step S5, when the user lacks the past evaluation text or the user has other preference requirements, the user can manually input a plurality of feature keywords, and the keywords are used as co-occurrence word pairs which are associated with each other and are stored in a feature database;
step S6, collecting other customer evaluation texts corresponding to specific scenes in a user selection range (such as a closer geographical position, a proper price range and the like), and summarizing to generate a target document; for example, the evaluation texts of other customers such as restaurants and hotels are collected and generated into target documents by taking the restaurants and hotels as units;
step S7, preprocessing the target document, including but not limited to word segmentation, stop word removal and other methods;
step S8, extracting a characteristic word set of the target document, searching the distance of each word in the characteristic word set from the characteristic database, and generating a characteristic word chain;
step S9, dividing a single sentence from the target document;
and step S10, calculating the similarity between the characteristic words contained in each single sentence in the target document and the characteristic word chain, and selecting the single sentence with the highest similarity of each characteristic word chain in sequence to generate the abstract according to the abstract scale limitation.
In one embodiment, in step S3, the specific calculation method of the word pair co-occurrence rate in 1 sample document d is as follows:
any two words w in the sample document diAnd wjWord pair co-occurrence rate P ofd(wi,wj) Calculated by the following formula:
Figure BDA0002361299250000081
wherein, T is the window unit set in the sample document d; w is the set of words W ═ { W ═ in the sample document d1,w2,…,wn},wiAnd wjThe words are any two words in the word set W, and i, j and n are positive integers; sd(wi) Indicating that w is contained in the sample document diThe number of window units of (a); sd(wj) Indicating that w is contained in the sample document djThe number of window units of (a); sd(wi,wj) Representing the sample document d simultaneously containing wiAnd wjThe number of window units of (a); n is a radical oft(wi,wj) Represented in a certain window unit t by wiAnd wjThe number of co-occurrences, T ∈ T, where w isiAnd wjWhen the number of occurrences in the same paragraph is different, take wiAnd wjThe minimum number of occurrences of (c) is the number of co-occurrences.
In one embodiment, in the step S3, k sample documents { d }1,d2,…,dkThe specific calculation method of the total co-occurrence rate of the words in the Chinese character is as follows:
Figure BDA0002361299250000091
wherein k is a positive integer greater than 1; pk-1(wi,wj) For the first k-1 sample documents { d1,d2,…,dk-1The word w iniAnd wjThe total co-occurrence rate of word pairs of; sk-1(wi,wj) For the first k-1 sample documents { d1,d2,…,dk-1The word pair w iniAnd wjThe total number of window units that co-occur,
Figure BDA0002361299250000092
Figure BDA0002361299250000093
for the kth sample document dkWord w iniAnd wjThe word pair co-occurrence rate of;
Figure BDA0002361299250000094
representing the kth sample document dkIn the same time contains wiAnd wjThe number of window units of (a); d is sample document space { D1,d2,…,dk}。
For example, the 1 st sample document d is first calculated1Chinese vocabulary wiAnd wjAnd simultaneously contains wiAnd wjThe number of window units of (1) is marked as P1(wi,wj) And S1(wi,wj) (ii) a Then calculate the vocabulary w in the 2 nd documentiAnd wjWord pair co-occurrence rate of
Figure BDA0002361299250000095
And simultaneously contains wiAnd wjNumber of window units
Figure BDA0002361299250000096
The vocabulary w in the two documentsiAnd wjThe total co-occurrence rate of the word pairs and the total window unit number of co-occurrence are respectively as follows:
Figure BDA0002361299250000101
Figure BDA0002361299250000102
and repeating the steps until the vocabulary w in the k sample documents is calculatediAnd wjWord pair total co-occurrence rate.
In one embodiment, the feature database is constructed using the following structure:
the feature database structure is composed of feature words and co-occurrence word pairs, and the feature word type and the co-occurrence word pair type are designed.
The Focus Word is a data structure of the characteristic words:
Figure BDA0002361299250000103
word Co-occurrent is a data structure of a Co-occurrence Word pair and consists of a number IDone of a first Word, a number IDtwo of a second Word, a Co-occurrence rate value and a Co-occurrence number CoNum, in order to reduce storage space, a threshold value α can be set during storage, and corresponding parameters are stored in a feature database only when the Co-occurrence rate of a certain Word pair is higher than a threshold value α.
The method comprises the following specific operations of assuming that a certain sample document space comprises a plurality of sample documents, selecting one sample document each time, and calculating the Co-occurrence rate of any Word pair by a calculation method of the Co-occurrence rate of the Word pairs in 1 sample document d, when a Word pair with the Co-occurrence rate higher than a threshold value α is found, storing the names of two words in the Word pair in a Focus Word and adding a unique number, and storing the number, the Co-occurrence rate and the Co-occurrence frequency of the two words in a WordCo-occurrence.
In an embodiment, in step S8, extracting the feature word set of the target document, searching a distance between words in the feature word set from the feature database, and generating a feature word chain specifically includes:
for a target document d to be generated with the abstract, searching the distance of each vocabulary in the feature word set from the feature database and the target document d, and generating a feature word chain, wherein the method comprises the following steps:
stored in the feature database is a feature word set V ═ V1,v2,…,vmE (v) and co-occurrence word pair set E ═ E (v)i,vj) }; e any co-occurrence word pair E (v)i,vj) All are Word Co-occurrence structures, and comprise the numbers, the Co-occurrence rates and the Co-occurrence times of two vocabularies. The target document d contains a characteristic word set W ═ W1,w2,…,wn}。
For any word pair in W (W)i,wj) If w isi∈V、wjE.g., V, and the word pair is contained in E, i.e., there is E (w)i,wj) Then the degree of association l of these two wordsijIs e (w)i,wj) The corresponding co-occurrence rate of (1); if wiOr wjThe association degree of the two words is as follows: lij=Pd(wi,wj). In this way, the association degree set L ═ L of all word pairs in the feature word set W of the target document d is generated11,l12,…,l1n,l21,…l2n,…,l(n-1)n}。
There is a tighter latent semantic meaning between the word pairs with higher relevance. According to the association degree between the word pairs, a plurality of characteristic word chains can be constructed through a clustering method, so that a plurality of related topics of the target document are shown. And summarizing the characteristic word chains to form a characteristic word chain set C according to the sequence of the characteristic word chains generated in clustering. The specific method comprises the following steps:
inputting: threshold value gamma of clustering
Word pair association degree L to be clustered
Word set W to be clustered
And (3) outputting: feature word chain set C
The method comprises the following steps:
(1) selecting an unclassified word W from the feature word set Wi
(2) Find w from LiAnd (4) the association degree of the existing words of the existing word chains in the C. If with word chain cjIf the relevance of a certain vocabulary is greater than the threshold value gamma, adding the word chain cj(ii) a If the degree of association between wi and all existing words of all existing word chains in C is less than gamma, wiBecomes the first word of the new word chain.
(3) Repeating the steps (1) and (2) until all the words are added into C.
(4) And (5) ending the algorithm, and returning the feature word chain set C.
In one embodiment, the similarity between the feature words contained in the single sentence and the feature word chain is calculated by the following specific steps:
the target document d comprises a set of single sentences S and a characteristic word set W ═ W1,w2,…,wn}. According to single sentences S in the single sentence set SqWhether the feature word in W is included or not, s can be obtainedqRelation with each feature word in W
Figure BDA0002361299250000121
Wherein:
Figure BDA0002361299250000122
the target document d comprises a characteristic word chain set C and a characteristic word set W ═ W1,w2,…,wn}. According to word chain C in word chain set CpIf the feature words in the feature word set W are contained, c can be obtainedpRelation with each feature word in W
Figure BDA0002361299250000131
Wherein:
Figure BDA0002361299250000132
simple sentence sqAnd word chain cpThe similarity of (d) is calculated by the following formula:
Figure BDA0002361299250000133
in one embodiment, the topic evaluation summary selection abstract sentence is calculated by the following steps:
similarity of single sentence and word chain in target document d is set by U ═{u11,u12,…,ufgDenotes wherein uij=Sim(si,cj) F is the total number of single sentences in the target document d, and g is the total number of feature word chains.
The number of sentences allowed to enter the abstract is determined by:
N=f×R
where N denotes the limited number of digest sentences, R denotes the compression ratio of the digest, 0< R <1, and the value of R is set by the user as desired. And if the number N of the abstract sentences needing to be extracted is less than the total number g of the feature word chains, selecting a single sentence with the highest similarity with the first N word chains in the word chain set C from the S as the abstract sentence. And if the number N of the abstract sentences to be extracted is greater than the total number g of the feature word chains, sequentially extracting the single sentences with the highest similarity to each word chain in the word chains C from the S as the abstract sentences, and continuously extracting the abstract sentences from the word chains with the highest similarity of the single sentences until the number of the abstract sentences is met.
The extraction method comprises the following steps:
inputting: feature word chain set C
Set of single sentences S
Similarity set U
Number N of abstract sentences to be extracted
Similarity threshold δ
And (3) outputting: set of abstract sentences Y
The method comprises the following steps:
(1)j=1,r=0
(2)Do
(3) selecting a word chain C of a feature word chain set Cj
(4) Selecting U from UijSo that u isij=max(u1j,u2j,…,ufj) And u isij>δ,
Figure BDA0002361299250000141
(5) If u is presentijSatisfying the selection condition, and adding siAdding a summary sentence set Y, otherwise, recording the idle selection times r as r + 1;
(6) if j is less than the quantity g in the cluster C, j is j +1, and r is 0, otherwise j is 1;
(7) the number of UntilY N is equal to N, or r ═ g, that is, all sets are not satisfied.
(8) And (5) after the algorithm is finished, returning a summary sentence set Y.
And finally, sequencing according to the original sequence of the abstract sentences in the Y in the target document d to form a theme evaluation abstract meeting the preference of the user.
The method can be used for catering evaluation and other fields such as hotels, tourism, online shopping and the like, and generates the query type automatic abstract based on the evaluation text. In this embodiment, the method for calculating the similarity between the single sentence and the feature word chain includes not only a cosine similarity method, but also other similarity calculation methods such as a Jaccard formula, a Dice formula, an Overlap formula, and the like.
Example 2:
as shown in fig. 2, the present embodiment provides a method for automatically generating a topic evaluation summary based on user preferences, which includes the following steps:
step 101, collecting an evaluation text which is published on the internet after a customer in the past as a sample document as a basis for mining the catering preference of the customer;
step 102, preprocessing a sample document, including but not limited to word segmentation, stop word removal and other methods;
step 103, calculating word pair co-occurrence rate from the preprocessed sample document;
step 104, storing the word pair co-occurrence rate into a feature database;
step 105, when the past dinning evaluation text of the customer is lacked or the customer has other preference requirements, a plurality of feature keywords can be manually input by the customer, and the keywords are stored in a feature database as pairwise associated co-occurrence word pairs;
step 106, collecting other customer evaluation texts corresponding to restaurants within a customer selection range (such as a near geographic position, a proper price range and the like), and respectively summarizing and generating a target document by taking the restaurants as a unit;
step 107, preprocessing the target document, including but not limited to word segmentation, stop word removal and other methods;
step 108, extracting a characteristic word set of the target document, searching the distance of each vocabulary in the characteristic word set from the characteristic database, and generating a characteristic word chain;
step 109, dividing a single sentence from the target document;
and 110, calculating the similarity between the characteristic words contained in each single sentence in the target document and the characteristic word chain, and sequentially selecting the single sentence with the highest similarity of each characteristic word chain to generate the abstract according to the abstract scale limitation.
Example 2 differs from example 1 in that: the specific scenario of embodiment 2 is a catering occasion, and the method collects evaluation texts that are published on the internet after a customer in the past as sample documents, and uses the evaluation texts as a basis for mining catering preferences of the customer, and further analyzes the existing evaluation text information of other users in a restaurant according to interest preferences exhibited by the past evaluation of the user, and automatically generates an evaluation summary of the restaurant.
It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and other modifications or equivalent substitutions made by the technical solutions of the present invention by those skilled in the art should be covered in the scope of the claims of the present invention as long as they do not depart from the spirit and scope of the technical solutions of the present invention.

Claims (6)

1. A method for automatically generating a theme evaluation abstract based on user preference is characterized in that: the method comprises the following steps:
step 101, collecting an evaluation text published on the internet by a user based on a specific scene in the past as a sample document;
step 102, preprocessing a sample document;
step 103, calculating word pair co-occurrence rate from the preprocessed sample document;
step 104, storing the word pair co-occurrence rate into a feature database;
step 105, when the past evaluation texts of the user are lacked or the user has other preference requirements, manually inputting a plurality of feature keywords by the user, and storing the keywords into a feature database as pairwise associated co-occurrence word pairs;
step 106, collecting evaluation texts of other corresponding users in the specific scene in accordance with the user selection range, and respectively summarizing to generate target documents;
step 107, preprocessing the target document;
step 108, extracting a characteristic word set of the target document, searching the distance of each vocabulary in the characteristic word set from the characteristic database, and generating a characteristic word chain;
step 109, dividing a single sentence from the target document;
and 110, calculating the similarity between the characteristic words contained in each single sentence in the target document and the characteristic word chain, and sequentially selecting the single sentence with the highest similarity of each characteristic word chain to generate the abstract according to the similarity relation between the single sentence and each characteristic word chain.
2. The method for automatically generating a topic evaluation summary based on user preferences of claim 1, wherein: in step 103, the specific calculation method of the word pair co-occurrence rate in the 1 sample document d is as follows:
any two words w in the sample document diAnd wjWord pair co-occurrence rate P ofd(wi,wj) Calculated by the following formula:
Figure FDA0002361299240000011
wherein, T is the window unit set in the sample document d; w is the set of words W ═ { W ═ in the sample document d1,w2,…,wn},wiAnd wjThe words are any two words in the word set W, and i, j and n are positive integers; sd(wi) Indicating that w is contained in the sample document diThe number of window units of (a); sd(wj) Indicating that w is contained in the sample document djThe number of window units of (a); sd(wi,wj) Representing simultaneity in the sample document dComprises wiAnd wjThe number of window units of (a); n is a radical oft(wi,wj) Represented in a certain window unit t by wiAnd wjThe number of co-occurrences, T ∈ T, where w isiAnd wjWhen the number of occurrences in the same paragraph is different, take wiAnd wjThe minimum number of occurrences of (c) is the number of co-occurrences.
3. The method of claim 2, wherein the method of automatically generating a topic evaluation summary based on user preferences comprises: in the step 103, k sample documents { d }1,d2,…,dkThe specific calculation method of the total co-occurrence rate of the words in the Chinese character is as follows:
Figure FDA0002361299240000021
wherein k is a positive integer greater than 1; pk-1(wi,wj) For the first k-1 sample documents { d1,d2,…,dk-1The word w iniAnd wjThe total co-occurrence rate of word pairs of; sk-1(wi,wj) For the first k-1 sample documents { d1,d2,…,dk-1The word pair w iniAnd wjThe total number of window units that co-occur,
Figure FDA0002361299240000022
Figure FDA0002361299240000023
for the kth sample document dkWord w iniAnd wjThe word pair co-occurrence rate of;
Figure FDA0002361299240000024
representing the kth sample document dkIn the same time contains wiAnd wjThe number of window units of (a); d is sample document space { D1,d2,…,dk}。
4. A method for automatically generating a topic evaluation summary based on user preferences according to claim 1, 2 or 3 wherein: the feature database is composed of feature word and co-occurrence word pairs.
5. The method of claim 4 for automatically generating a topic evaluation summary based on user preferences, wherein: the searching the distance of each vocabulary in the feature word set from the feature database to generate a feature word chain comprises: for a target document d to be generated with the abstract, searching the distance of each vocabulary in the feature word set from the feature database and the target document d, and generating a feature word chain, wherein the specific steps are as follows:
stored in the feature database is a feature word set V ═ V1,v2,…,vmE (v) and co-occurrence word pair set E ═ E (v)i,vj) Any co-occurrence word pair E (v) in Ei,vj) All the structures are Word Co-occurence structures and comprise the numbers, the Co-occurrence rates and the Co-occurrence times of two vocabularies;
the target document d contains a characteristic word set W ═ W1,w2,…,wn}; for any word pair in W (W)i,wj) If w isi∈V、wjE.g., V, and the word pair is contained in E, i.e., there is E (w)i,wj) Then the degree of association l of these two wordsijIs e (w)i,wj) The corresponding co-occurrence rate of (1); if wiOr wjThe association degree of the two words is as follows: lij=Pd(wi,wj) (ii) a In this way, the association degree set L ═ L of all word pairs in the feature word set W of the target document d is generated11,l12,…,l1n,l21,…l2n,…,l(n-1)n};
According to the association degree between the word pairs, a plurality of characteristic word chains are constructed through a clustering method, so that a plurality of related topics of the target document d are shown; and summarizing the characteristic word chains to form a characteristic word chain set C according to the sequence of the characteristic word chains generated in clustering.
6. The method for automatically generating a topic evaluation summary based on user preferences of claim 5, wherein: the similarity between the characteristic words contained in the single sentence and the characteristic word chain is calculated through the following specific steps:
the target document d comprises a set of single sentences S and a characteristic word set W ═ W1,w2,…,wn}; according to single sentences S in the single sentence set SqWhether the feature word in W is included or not, s can be obtainedqRelation with each feature word in W
Figure FDA0002361299240000031
Wherein:
Figure FDA0002361299240000032
the target document d comprises a characteristic word chain set C and a characteristic word set W ═ W1,w2,…,wn}; according to word chain C in word chain set CpIf the feature words in the feature word set W are contained, c can be obtainedpRelation with each feature word in W
Figure FDA0002361299240000033
Wherein:
Figure FDA0002361299240000041
simple sentence sqAnd word chain cpThe similarity of (d) is calculated by the following formula:
Figure FDA0002361299240000042
CN202010022473.7A 2020-01-09 2020-01-09 Method for automatically generating theme evaluation abstract based on user preference Active CN111259136B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010022473.7A CN111259136B (en) 2020-01-09 2020-01-09 Method for automatically generating theme evaluation abstract based on user preference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010022473.7A CN111259136B (en) 2020-01-09 2020-01-09 Method for automatically generating theme evaluation abstract based on user preference

Publications (2)

Publication Number Publication Date
CN111259136A true CN111259136A (en) 2020-06-09
CN111259136B CN111259136B (en) 2024-03-22

Family

ID=70954087

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010022473.7A Active CN111259136B (en) 2020-01-09 2020-01-09 Method for automatically generating theme evaluation abstract based on user preference

Country Status (1)

Country Link
CN (1) CN111259136B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420142A (en) * 2021-05-08 2021-09-21 广东恒宇信息科技有限公司 Personalized automatic abstract algorithm
CN113918708A (en) * 2021-12-15 2022-01-11 深圳市迪博企业风险管理技术有限公司 Abstract extraction method

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006215850A (en) * 2005-02-04 2006-08-17 Nippon Telegr & Teleph Corp <Ntt> Apparatus and method for creating concept information database, program, and recording medium
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006215850A (en) * 2005-02-04 2006-08-17 Nippon Telegr & Teleph Corp <Ntt> Apparatus and method for creating concept information database, program, and recording medium
CN106156272A (en) * 2016-06-21 2016-11-23 北京工业大学 A kind of information retrieval method based on multi-source semantic analysis

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王磊;黄广君;: "结合概念语义空间的语义扩展技术研究", 计算机工程与应用, no. 35 *
邓箴;包宏;: "基于词汇链的多文档自动文摘研究", 计算机与应用化学, no. 11 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420142A (en) * 2021-05-08 2021-09-21 广东恒宇信息科技有限公司 Personalized automatic abstract algorithm
CN113918708A (en) * 2021-12-15 2022-01-11 深圳市迪博企业风险管理技术有限公司 Abstract extraction method

Also Published As

Publication number Publication date
CN111259136B (en) 2024-03-22

Similar Documents

Publication Publication Date Title
Al-Radaideh et al. A hybrid approach for arabic text summarization using domain knowledge and genetic algorithms
Deshpande et al. Building, maintaining, and using knowledge bases: a report from the trenches
Zhao et al. Topical keyphrase extraction from twitter
US9679001B2 (en) Consensus search device and method
CN109960756B (en) News event information induction method
US20120029908A1 (en) Information processing device, related sentence providing method, and program
CN110083696B (en) Global citation recommendation method and system based on meta-structure technology
US8812504B2 (en) Keyword presentation apparatus and method
Di Fabbrizio et al. Summarizing online reviews using aspect rating distributions and language modeling
Al-Taani et al. An extractive graph-based Arabic text summarization approach
CN110750995A (en) File management method based on user-defined map
CN110888991A (en) Sectional semantic annotation method in weak annotation environment
Dorji et al. Extraction, selection and ranking of Field Association (FA) Terms from domain-specific corpora for building a comprehensive FA terms dictionary
Lin et al. A simple but effective method for Indonesian automatic text summarisation
Kisilevich et al. “Beautiful picture of an ugly place”. Exploring photo collections using opinion and sentiment analysis of user comments
CN111259136B (en) Method for automatically generating theme evaluation abstract based on user preference
Gupta A survey of text summarizers for Indian Languages and comparison of their performance
Yu et al. Role-explicit query identification and intent role annotation
Balasubramanian et al. Topic pages: An alternative to the ten blue links
Haubold et al. Web-based information content and its application to concept-based video retrieval
AT&T
KR102275095B1 (en) The informatization method for youtube video metadata for personal media production
Charnine et al. Association-Based Identification of Internet Users Interest
Almasian et al. Qfinder: A framework for quantity-centric ranking
Li et al. CLC-RS: a Chinese legal case retrieval system with masked language ranking

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant