CN111259136A

CN111259136A - Method for automatically generating theme evaluation abstract based on user preference

Info

Publication number: CN111259136A
Application number: CN202010022473.7A
Authority: CN
Inventors: 何为; 刘楠; 马文鹏; 李银
Original assignee: Xinyang Normal University
Current assignee: Xinyang Normal University
Priority date: 2020-01-09
Filing date: 2020-01-09
Publication date: 2020-06-09
Anticipated expiration: 2040-01-09
Also published as: CN111259136B

Abstract

The invention discloses a method for automatically generating a theme evaluation abstract based on user preference, which comprises the steps of collecting the past evaluation information of a customer, analyzing an evaluation sample document of a field of interest provided by the customer, utilizing semantic relation implied by the co-occurrence relation of word pairs in the sample document, and establishing a feature database of the co-occurrence word pairs by calculating the co-occurrence rate of the word pairs; and by utilizing the characteristic database, clustering of characteristic word chains and calculation of similarity are carried out on the target text, so that the query type automatic abstract which is interested by customers is provided. The method is not limited to the field of catering evaluation, and can also be used for recommending other activities such as online shopping consumption and travel and information retrieval in public and professional fields.

Description

Method for automatically generating theme evaluation abstract based on user preference

Technical Field

The invention relates to an automatic abstract processing technology of natural language processing direction in the field of computer research, in particular to a method for automatically generating a theme evaluation abstract based on user preference.

Background

With the continuous development of the internet and information technology, the research on automatic summarization enters the unprecedented prosperity stage. According to the users, the automatic Summarization can be divided into general automatic Summarization (Generic Summarization) and Query-based automatic Summarization (Query-binary Summarization). The Query-type automatic Summarization (a Summarization with corresponding emphasis is provided according to needs or interests of users, also called a User-focused Summarization), a Topic-focused Summarization (Topic-focused Summarization) or a Query-focused Summarization (Query-focused Summarization), compared with a general Summarization focusing on overall main content of a full-text main body, the Query-type automatic Summarization reflects more individual requirements of users.

The query-type automatic summarization is researched from 80 s of the 20 th century in the earliest abroad, the query-type automatic summarization is an important component of the automatic summarization, and because the query-type automatic summarization and the general automatic summarization have similar limitations on the scale of results, a method for extracting similar semantic sentences is mostly adopted to form summaries of related topics. The term "query-summary for query-summary association of the web" (by v. plachouras and i.oins, published in 2005 Information report, vol 8, 2) studies methods for improving query accuracy in web pages using auto-summarization. An article, which is an attribute-oriented decision on the underfilling effects of query-binary searching in web search (authors are r.white, j.jose and i.ruthenn, published in Information Processing & Management 2003, vol. 39, No. 5) proposes that query-type auto-digests can be generated according to the frequency of occurrence of query words in each sentence of a web page and text styles; the application of the keyword density distribution method in the abstract of the article (the author is Yan Invitrogen, forest hongfei, Yangxihao, Zhao Jing, published in 2007 computer engineering, Vol. 33, No. 6) adopts the keyword density algorithm to generate the query-type automatic abstract.

The key step of query-based automatic summarization is how to obtain the biased topics of the query, and the common strategy for obtaining the biased topics is to perform expansion on the concept semantics of the query of the user. Some scholars use the word frequency and position of the user query keyword in the article to obtain the weight, but the method can only mechanically obtain the result and cannot meet the semantic requirement. It is difficult to accurately define the true interests of a user's query simply by means of simple query terms. The ideal method is to expand the query words of the user by adopting general semantic resources, and at present, no applicable general semantic resource exists, and some scholars use the existing semantic library to obtain the bias, such as WordNet in English, HowNet in Chinese, synonym forest and the like as the semantic resource library. Similarity between feature words is calculated using a knowledge network as an article a sequence Selection Method of Query-Based Chinese Multi-Document summerization (authors x.song, j.huang, j.zhou and h.zhang, published in 2009, proceedings of paciia2009) and used to guide abstract Sentence Selection for Query-Based auto-Summarization. Therefore, the method is easily limited by the scale of the semantic library and the updating speed, and is difficult to meet the challenge of mass newly added words of the Internet.

With the wide application of the mobile internet, when a dining place is selected, the evaluation information of other people on the dining room is checked on the network, and the evaluation of the people on the dining is published after meal, so that the mobile internet becomes a popular life style for young people. The existing catering evaluation websites such as the public comment network, the American group network and the like have collected a large amount of evaluation information of diners, and the evaluation information becomes an evaluation basis for selecting dining places by others. However, when other people select a restaurant, the people need to check and read a plurality of comment information one by one to know whether the comment information meets the needs of the people. Some websites require users to score when submitting evaluations, and the collected scores are used as recommendation indexes, but the individual needs, consumption habits and tastes of the users are different. This simple scoring value does not provide a basis for the user to select a restaurant that is appropriate for the user.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method for automatically generating a theme evaluation abstract based on user preference, which is used for a user to select actual application scenes of restaurants, hotels, tourism and the like according to evaluation.

The technical scheme adopted for realizing the aim of the invention is as follows:

a method for automatically generating a theme evaluation abstract based on user preference comprises the following steps:

step 101, collecting an evaluation text published on the internet by a user based on a specific scene in the past as a sample document;

step 102, preprocessing a sample document;

step 103, calculating word pair co-occurrence rate from the preprocessed sample document;

step 104, storing the word pair co-occurrence rate into a feature database;

step 105, when the past evaluation texts of the user are lacked or the user has other preference requirements, manually inputting a plurality of feature keywords by the user, and storing the keywords into a feature database as pairwise associated co-occurrence word pairs;

step 106, collecting evaluation texts of other corresponding users in the specific scene in accordance with the user selection range, and respectively summarizing to generate target documents;

step 107, preprocessing the target document;

step 108, extracting a characteristic word set of the target document, searching the distance of each vocabulary in the characteristic word set from the characteristic database, and generating a characteristic word chain;

step 109, dividing a single sentence from the target document;

and 110, calculating the similarity between the characteristic words contained in each single sentence in the target document and the characteristic word chain, and sequentially selecting the single sentence with the highest similarity of each characteristic word chain to generate the abstract according to the similarity relation between the single sentence and each characteristic word chain.

Preferably, in step 103, a specific calculation method of the word pair co-occurrence rate in the 1 sample document d is as follows:

any two words w in the sample document d_iAnd w_jWord pair co-occurrence rate P of_d(w_i,w_j) Calculated by the following formula:

wherein, T is the window unit set in the sample document d; w is the set of words W ═ { W ═ in the sample document d₁,w₂,…,w_n}，w_iAnd w_jThe words are any two words in the word set W, and i, j and n are positive integers; s_d(w_i) Indicating that w is contained in the sample document d_iThe number of window units of (a); s_d(w_j) Indicating that w is contained in the sample document d_jThe number of window units of (a); s_d(w_i,w_j) Representing the sample document d simultaneously containing w_iAnd w_jThe number of window units of (a); n is a radical of_t(w_i,w_j) Represented in a certain window unit t by w_iAnd w_jThe number of co-occurrences, T ∈ T, where w is_iAnd w_jWhen the number of occurrences in the same paragraph is different, take w_iAnd w_jThe minimum number of occurrences of (c) is the number of co-occurrences.

Preferably, in step 103, k sample documents { d }₁,d₂,…,d_kThe specific calculation method of the total co-occurrence rate of the words in the Chinese character is as follows:

wherein k is a positive integer greater than 1; p^k-1(w_i,w_j) For the first k-1 sample documents { d₁,d₂,…,d_k-1The word w in_iAnd w_jThe total co-occurrence rate of word pairs of; s^k-1(w_i,w_j) For the first k-1 sample documents { d₁,d₂,…,d_k-1The word pair w in_iAnd w_jThe total number of window units that co-occur,

for the kth sample document d_kWord w in_iAnd w_jThe word pair co-occurrence rate of;

representing the kth sample document d_kIn the same time contains w_iAnd w_jThe number of window units of (a); d is sample document space { D₁,d₂,…,d_k}。

Preferably, the feature database is composed of feature word and co-occurrence word pairs.

Preferably, the searching for the distance between each vocabulary in the feature word set from the feature database to generate a feature word chain includes: for a target document d to be generated with the abstract, searching the distance of each vocabulary in the feature word set from the feature database and the target document d, and generating a feature word chain, wherein the specific steps are as follows:

stored in the feature database is a feature word set V ═ V₁,v₂,…,v_mE (v) and co-occurrence word pair set E ═ E (v)_i,v_j) Any co-occurrence word pair E (v) in E_i,v_j) All the structures are Word Co-occurence structures and comprise the numbers, the Co-occurrence rates and the Co-occurrence times of two vocabularies;

the target document d contains a characteristic word set W ═ W₁,w₂,…,w_n}; for any word pair in W (W)_i,w_j) If w is_i∈V、w_jE.g., V, and the word pair is contained in E, i.e., there is E (w)_i,w_j) Then the degree of association l of these two words_ijIs e (w)_i,w_j) The corresponding co-occurrence rate of (1); if w_iOr w_jThe association degree of the two words is as follows: l_ij＝P_d(w_i,w_j) (ii) a Thus, a target document is generatedd association degree set L ═ L of all word pairs in feature word set W₁₁,l₁₂,…,l_1n,l₂₁,…l_2n,…,l_(n-1)n}；

According to the association degree between the word pairs, a plurality of characteristic word chains are constructed through a clustering method, so that a plurality of related topics of the target document d are shown; and summarizing the characteristic word chains to form a characteristic word chain set C according to the sequence of the characteristic word chains generated in clustering.

Preferably, the similarity between the feature words contained in the single sentence and the feature word chain is calculated by the following specific steps:

the target document d comprises a set of single sentences S and a characteristic word set W ═ W₁,w₂,…,w_n}; according to single sentences S in the single sentence set S_qWhether the feature word in W is included or not, s can be obtained_qRelation with each feature word in W

Wherein:

the target document d comprises a characteristic word chain set C and a characteristic word set W ═ W₁,w₂,…,w_n}; according to word chain C in word chain set C_pIf the feature words in the feature word set W are contained, c can be obtained_pRelation with each feature word in W

Wherein:

simple sentence s_qAnd word chain c_pThe similarity of (d) is calculated by the following formula:

the invention has the beneficial effects that:

the method overcomes the defect that the theme which is interested by the user is difficult to be described when the query type automatic abstract is obtained by expanding the query words provided by the user in the prior art. By extracting word pairs in the subject text, the implicit connection between the characteristic words of the subject text can be found, and the content really interested by the user can be extracted from the target text to generate the query type automatic abstract. By means of the overall clustering of the evaluation texts, the method can avoid the deviation of subjective opinions of individual users, and enables the conclusion to be more objective and fair.

The invention has the effect that the semantic relation implied by the co-occurrence relation of the word pairs in the sample document is utilized, the co-occurrence frequency and the co-occurrence position of the word pairs contain the implicit relation between the word pairs and the theme, and the more the number of the occurrence times of a certain word pair in the window unit is, the closer relation of the word pair in the theme is shown. By calculating the co-occurrence rate of the word pairs in the single text and the plurality of texts, the generated co-occurrence word pair has the characteristics of expandability and updatability for the theme feature library, and the interesting content of the user can be more obviously embodied along with the increase of the number of the theme texts.

The invention discloses a method for automatically generating an evaluation abstract of a restaurant by analyzing the existing evaluation text information of other users of the restaurant according to the interest preference of the past evaluation display of the user. The method is characterized in that an incidence relation in user evaluation is mined in a co-occurrence word pair mode and is used as a basis for selecting abstract sentences. The method is not limited to the field of catering evaluation, and is also suitable for recommending other activities such as online shopping consumption, travel, accommodation and the like.

Drawings

FIG. 1 is a flowchart of a method of example 1 of the present invention.

FIG. 2 is a flowchart of a method of embodiment 2 of the present invention.

Detailed Description

The invention is further described in detail below with reference to the drawings and the detailed description.

The co-occurrence analysis of words is one of the successful applications of natural language processing technology in information retrieval, and the core idea of the method is that the co-occurrence frequency between words reflects the semantic association between the words to some extent. The study of co-occurrence was based on the following assumptions: in a large-scale corpus of text, two words are considered semantically related if they frequently appear together in the same window unit (e.g., a document, a natural paragraph, a sentence), and the more frequently they occur together, the closer the semantics of the two words are.

When the co-occurrence rate exceeds a certain threshold, the feature word pair is at least related to one topic. Since a document may be composed of one or more topics, the topics may be divided by window units. According to the habits of people on organizing languages, most articles have the characteristics of definite theme and compact structure, and the theme of the articles is usually composed of one or more natural sections, so that window units are divided according to the natural sections.

Two words appearing in a certain distance of an article form a co-occurrence word pair, the co-occurrence frequency and the co-occurrence position of the word pair contain implicit relation between the word pair and a theme, and the more the number of the appearance times of a certain word pair in a window unit, the closer relation between the word pair and the theme is shown, and the word pair can be two words related to semantics or fixed collocation and the like. And the two words appear in different window units respectively, and do not necessarily have higher contribution to the theme because the words may belong to different themes.

The method utilizes semantic relation implied by the co-occurrence relation of the word pairs in the sample document, establishes a subject feature library of the co-occurrence word pairs by calculating the co-occurrence rate of the word pairs, and realizes an extensible and updated semantic resource library facing the user interested field; designing an inquiry type automatic abstract extraction method by utilizing the theme feature library and through clustering of feature word chains and calculation of similarity; the method limits that paragraphs are taken as window units, and has a good abstract effect on articles with traditional chapter structures.

Example 1:

as shown in fig. 1, the present embodiment provides a method for automatically generating a topic evaluation summary based on user preferences, which includes the following steps:

step S1, collecting the evaluation text published on the network by the user based on the specific scene as the sample document as the basis of mining the user preference; the specific scenes comprise restaurants, lodging, tourism, online shopping and the like;

step S2, preprocessing the sample document, including but not limited to word segmentation and stop word removal;

step S3, calculating word pair co-occurrence rate from the preprocessed sample document;

step S4, storing the word pair co-occurrence rate into a feature database;

step S5, when the user lacks the past evaluation text or the user has other preference requirements, the user can manually input a plurality of feature keywords, and the keywords are used as co-occurrence word pairs which are associated with each other and are stored in a feature database;

step S6, collecting other customer evaluation texts corresponding to specific scenes in a user selection range (such as a closer geographical position, a proper price range and the like), and summarizing to generate a target document; for example, the evaluation texts of other customers such as restaurants and hotels are collected and generated into target documents by taking the restaurants and hotels as units;

step S7, preprocessing the target document, including but not limited to word segmentation, stop word removal and other methods;

step S8, extracting a characteristic word set of the target document, searching the distance of each word in the characteristic word set from the characteristic database, and generating a characteristic word chain;

step S9, dividing a single sentence from the target document;

and step S10, calculating the similarity between the characteristic words contained in each single sentence in the target document and the characteristic word chain, and selecting the single sentence with the highest similarity of each characteristic word chain in sequence to generate the abstract according to the abstract scale limitation.

In one embodiment, in step S3, the specific calculation method of the word pair co-occurrence rate in 1 sample document d is as follows:

In one embodiment, in the step S3, k sample documents { d }₁,d₂,…,d_kThe specific calculation method of the total co-occurrence rate of the words in the Chinese character is as follows:

For example, the 1 st sample document d is first calculated₁Chinese vocabulary w_iAnd w_jAnd simultaneously contains w_iAnd w_jThe number of window units of (1) is marked as P¹(w_i,w_j) And S¹(w_i,w_j) (ii) a Then calculate the vocabulary w in the 2 nd document_iAnd w_jWord pair co-occurrence rate of

And simultaneously contains w_iAnd w_jNumber of window units

The vocabulary w in the two documents_iAnd w_jThe total co-occurrence rate of the word pairs and the total window unit number of co-occurrence are respectively as follows:

and repeating the steps until the vocabulary w in the k sample documents is calculated_iAnd w_jWord pair total co-occurrence rate.

In one embodiment, the feature database is constructed using the following structure:

the feature database structure is composed of feature words and co-occurrence word pairs, and the feature word type and the co-occurrence word pair type are designed.

The Focus Word is a data structure of the characteristic words:

word Co-occurrent is a data structure of a Co-occurrence Word pair and consists of a number IDone of a first Word, a number IDtwo of a second Word, a Co-occurrence rate value and a Co-occurrence number CoNum, in order to reduce storage space, a threshold value α can be set during storage, and corresponding parameters are stored in a feature database only when the Co-occurrence rate of a certain Word pair is higher than a threshold value α.

The method comprises the following specific operations of assuming that a certain sample document space comprises a plurality of sample documents, selecting one sample document each time, and calculating the Co-occurrence rate of any Word pair by a calculation method of the Co-occurrence rate of the Word pairs in 1 sample document d, when a Word pair with the Co-occurrence rate higher than a threshold value α is found, storing the names of two words in the Word pair in a Focus Word and adding a unique number, and storing the number, the Co-occurrence rate and the Co-occurrence frequency of the two words in a WordCo-occurrence.

In an embodiment, in step S8, extracting the feature word set of the target document, searching a distance between words in the feature word set from the feature database, and generating a feature word chain specifically includes:

for a target document d to be generated with the abstract, searching the distance of each vocabulary in the feature word set from the feature database and the target document d, and generating a feature word chain, wherein the method comprises the following steps:

stored in the feature database is a feature word set V ═ V₁,v₂,…,v_mE (v) and co-occurrence word pair set E ═ E (v)_i,v_j) }; e any co-occurrence word pair E (v)_i,v_j) All are Word Co-occurrence structures, and comprise the numbers, the Co-occurrence rates and the Co-occurrence times of two vocabularies. The target document d contains a characteristic word set W ═ W₁,w₂,…,w_n}。

For any word pair in W (W)_i,w_j) If w is_i∈V、w_jE.g., V, and the word pair is contained in E, i.e., there is E (w)_i,w_j) Then the degree of association l of these two words_ijIs e (w)_i,w_j) The corresponding co-occurrence rate of (1); if w_iOr w_jThe association degree of the two words is as follows: l_ij＝P_d(w_i,w_j). In this way, the association degree set L ═ L of all word pairs in the feature word set W of the target document d is generated₁₁,l₁₂,…,l_1n,l₂₁,…l_2n,…,l_(n-1)n}。

There is a tighter latent semantic meaning between the word pairs with higher relevance. According to the association degree between the word pairs, a plurality of characteristic word chains can be constructed through a clustering method, so that a plurality of related topics of the target document are shown. And summarizing the characteristic word chains to form a characteristic word chain set C according to the sequence of the characteristic word chains generated in clustering. The specific method comprises the following steps:

inputting: threshold value gamma of clustering

Word pair association degree L to be clustered

Word set W to be clustered

And (3) outputting: feature word chain set C

The method comprises the following steps:

(1) selecting an unclassified word W from the feature word set W_i。

(2) Find w from L_iAnd (4) the association degree of the existing words of the existing word chains in the C. If with word chain c_jIf the relevance of a certain vocabulary is greater than the threshold value gamma, adding the word chain c_j(ii) a If the degree of association between wi and all existing words of all existing word chains in C is less than gamma, w_iBecomes the first word of the new word chain.

(3) Repeating the steps (1) and (2) until all the words are added into C.

(4) And (5) ending the algorithm, and returning the feature word chain set C.

In one embodiment, the similarity between the feature words contained in the single sentence and the feature word chain is calculated by the following specific steps:

the target document d comprises a set of single sentences S and a characteristic word set W ═ W₁,w₂,…,w_n}. According to single sentences S in the single sentence set S_qWhether the feature word in W is included or not, s can be obtained_qRelation with each feature word in W

Wherein:

the target document d comprises a characteristic word chain set C and a characteristic word set W ═ W₁,w₂,…,w_n}. According to word chain C in word chain set C_pIf the feature words in the feature word set W are contained, c can be obtained_pRelation with each feature word in W

Wherein:

in one embodiment, the topic evaluation summary selection abstract sentence is calculated by the following steps:

similarity of single sentence and word chain in target document d is set by U ═{u₁₁,u₁₂,…,u_fgDenotes wherein u_ij＝Sim(s_i,c_j) F is the total number of single sentences in the target document d, and g is the total number of feature word chains.

The number of sentences allowed to enter the abstract is determined by:

N＝f×R

where N denotes the limited number of digest sentences, R denotes the compression ratio of the digest, 0< R <1, and the value of R is set by the user as desired. And if the number N of the abstract sentences needing to be extracted is less than the total number g of the feature word chains, selecting a single sentence with the highest similarity with the first N word chains in the word chain set C from the S as the abstract sentence. And if the number N of the abstract sentences to be extracted is greater than the total number g of the feature word chains, sequentially extracting the single sentences with the highest similarity to each word chain in the word chains C from the S as the abstract sentences, and continuously extracting the abstract sentences from the word chains with the highest similarity of the single sentences until the number of the abstract sentences is met.

The extraction method comprises the following steps:

inputting: feature word chain set C

Set of single sentences S

Similarity set U

Number N of abstract sentences to be extracted

Similarity threshold δ

And (3) outputting: set of abstract sentences Y

The method comprises the following steps:

(1)j＝1,r＝0

(2)Do

(3) selecting a word chain C of a feature word chain set C_j；

(4) Selecting U from U_ijSo that u is_ij＝max(u_1j,u_2j,…,u_fj) And u is_ij>δ,

(5) If u is present_ijSatisfying the selection condition, and adding s_iAdding a summary sentence set Y, otherwise, recording the idle selection times r as r + 1;

(6) if j is less than the quantity g in the cluster C, j is j +1, and r is 0, otherwise j is 1;

(7) the number of UntilY N is equal to N, or r ═ g, that is, all sets are not satisfied.

(8) And (5) after the algorithm is finished, returning a summary sentence set Y.

And finally, sequencing according to the original sequence of the abstract sentences in the Y in the target document d to form a theme evaluation abstract meeting the preference of the user.

The method can be used for catering evaluation and other fields such as hotels, tourism, online shopping and the like, and generates the query type automatic abstract based on the evaluation text. In this embodiment, the method for calculating the similarity between the single sentence and the feature word chain includes not only a cosine similarity method, but also other similarity calculation methods such as a Jaccard formula, a Dice formula, an Overlap formula, and the like.

Example 2:

as shown in fig. 2, the present embodiment provides a method for automatically generating a topic evaluation summary based on user preferences, which includes the following steps:

step 101, collecting an evaluation text which is published on the internet after a customer in the past as a sample document as a basis for mining the catering preference of the customer;

step 102, preprocessing a sample document, including but not limited to word segmentation, stop word removal and other methods;

step 104, storing the word pair co-occurrence rate into a feature database;

step 105, when the past dinning evaluation text of the customer is lacked or the customer has other preference requirements, a plurality of feature keywords can be manually input by the customer, and the keywords are stored in a feature database as pairwise associated co-occurrence word pairs;

step 106, collecting other customer evaluation texts corresponding to restaurants within a customer selection range (such as a near geographic position, a proper price range and the like), and respectively summarizing and generating a target document by taking the restaurants as a unit;

step 107, preprocessing the target document, including but not limited to word segmentation, stop word removal and other methods;

step 109, dividing a single sentence from the target document;

and 110, calculating the similarity between the characteristic words contained in each single sentence in the target document and the characteristic word chain, and sequentially selecting the single sentence with the highest similarity of each characteristic word chain to generate the abstract according to the abstract scale limitation.

Example 2 differs from example 1 in that: the specific scenario of embodiment 2 is a catering occasion, and the method collects evaluation texts that are published on the internet after a customer in the past as sample documents, and uses the evaluation texts as a basis for mining catering preferences of the customer, and further analyzes the existing evaluation text information of other users in a restaurant according to interest preferences exhibited by the past evaluation of the user, and automatically generates an evaluation summary of the restaurant.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and other modifications or equivalent substitutions made by the technical solutions of the present invention by those skilled in the art should be covered in the scope of the claims of the present invention as long as they do not depart from the spirit and scope of the technical solutions of the present invention.

Claims

1. A method for automatically generating a theme evaluation abstract based on user preference is characterized in that: the method comprises the following steps:

step 102, preprocessing a sample document;

step 104, storing the word pair co-occurrence rate into a feature database;

step 107, preprocessing the target document;

step 109, dividing a single sentence from the target document;

2. The method for automatically generating a topic evaluation summary based on user preferences of claim 1, wherein: in step 103, the specific calculation method of the word pair co-occurrence rate in the 1 sample document d is as follows:

wherein, T is the window unit set in the sample document d; w is the set of words W ═ { W ═ in the sample document d₁,w₂,…,w_n}，w_iAnd w_jThe words are any two words in the word set W, and i, j and n are positive integers; s_d(w_i) Indicating that w is contained in the sample document d_iThe number of window units of (a); s_d(w_j) Indicating that w is contained in the sample document d_jThe number of window units of (a); s_d(w_i,w_j) Representing simultaneity in the sample document dComprises w_iAnd w_jThe number of window units of (a); n is a radical of_t(w_i,w_j) Represented in a certain window unit t by w_iAnd w_jThe number of co-occurrences, T ∈ T, where w is_iAnd w_jWhen the number of occurrences in the same paragraph is different, take w_iAnd w_jThe minimum number of occurrences of (c) is the number of co-occurrences.

3. The method of claim 2, wherein the method of automatically generating a topic evaluation summary based on user preferences comprises: in the step 103, k sample documents { d }₁,d₂,…,d_kThe specific calculation method of the total co-occurrence rate of the words in the Chinese character is as follows:

4. A method for automatically generating a topic evaluation summary based on user preferences according to claim 1, 2 or 3 wherein: the feature database is composed of feature word and co-occurrence word pairs.

5. The method of claim 4 for automatically generating a topic evaluation summary based on user preferences, wherein: the searching the distance of each vocabulary in the feature word set from the feature database to generate a feature word chain comprises: for a target document d to be generated with the abstract, searching the distance of each vocabulary in the feature word set from the feature database and the target document d, and generating a feature word chain, wherein the specific steps are as follows:

the target document d contains a characteristic word set W ═ W₁,w₂,…,w_n}; for any word pair in W (W)_i,w_j) If w is_i∈V、w_jE.g., V, and the word pair is contained in E, i.e., there is E (w)_i,w_j) Then the degree of association l of these two words_ijIs e (w)_i,w_j) The corresponding co-occurrence rate of (1); if w_iOr w_jThe association degree of the two words is as follows: l_ij＝P_d(w_i,w_j) (ii) a In this way, the association degree set L ═ L of all word pairs in the feature word set W of the target document d is generated₁₁,l₁₂,…,l_1n,l₂₁,…l_2n,…,l_(n-1)n}；

6. The method for automatically generating a topic evaluation summary based on user preferences of claim 5, wherein: the similarity between the characteristic words contained in the single sentence and the characteristic word chain is calculated through the following specific steps:

Wherein:

Wherein: