CN101887415A

CN101887415A - Automatic extraction method for text document theme word meaning

Info

Publication number: CN101887415A
Application number: CN 201010210106
Authority: CN
Inventors: 方俊; 郭雷; 常威威
Original assignee: Northwestern Polytechnical University
Current assignee: COMTEC SOLAR (JIANGSU) Co Ltd; Northwestern Polytechnical University
Priority date: 2010-06-24
Filing date: 2010-06-24
Publication date: 2010-11-17
Anticipated expiration: 2030-06-24
Also published as: CN101887415B

Abstract

The invention relates to an automatic extraction method for a text document theme word meaning, which comprises the following steps of: firstly, performing text document preprocessing on a training text document set and a testing text document set to obtain a candidate theme word meaning set of text documents in the training text document set and the testing text document set; then, calculating a characteristic attribute value of each candidate theme word meaning; and finally, extracting a final theme word meaning of each text document in the testing text document set by using a Bayesian model. The whole process for extracting the theme meaning by using word-meaning substituting words avoids inaccuracy caused by polysemy, and the method can improve the extraction precision of the theme meaning.

Description

Automatic extraction method of text document theme word senses

Technical Field

The invention relates to a method for automatically extracting theme word senses of a text document, and belongs to the fields of computer information processing, natural language processing and the like. The method is suitable for quickly and accurately extracting the topics of a large number of text documents.

Background

With the development of the Internet, the increasing speed of the total amount of information increases exponentially, a large amount of information is presented to people in the form of electronic text documents, and an automatic tool is urgently needed to help people to quickly find really needed information from a large amount of information. To achieve this goal, the primary task is to extract the topical meaning of a text document. In addition, the subject meaning can also be applied to many other text mining fields, such as text classification, text clustering, text retrieval and the like. In the most ideal situation, the subject meaning is given artificially, but because of the huge amount of text documents, the artificial giving of the subject meaning of the text documents becomes infeasible, so that the research of the high-performance subject meaning automatic extraction algorithm is very important.

The subject meaning of a text document represents summary information of the text document, and the task of the subject word extraction is to find out words capable of describing the content of the text document from the text document, so that the current research work uses the subject words to represent semantic information of text document resources, and converts the problem of the subject meaning extraction of the text document into the problem of the subject word extraction.

The existing research method is to use the subject word to represent the subject meaning of the text document, and because the difference between the vocabulary level (the word representing the meaning) and the concept level (the meaning itself) means that the same word has different word meanings in different context environments, and different words can also represent the same meaning, this will cause inaccuracy of the subject meaning extraction, which is mainly expressed in the following two aspects:

● the meaning of the subject is inaccurate. Because words have different word senses, if words are used to express subject meanings, the expressed subject meanings may be ambiguous, for example, "mouse" can express mouse or mouse meaning, and confusion will occur when mouse is given to express the subject meaning of a text document;

● inaccuracies in the subject meaning extraction process. In the process of extracting the meaning of the subject, the existing method carries out various operations on the words, wherein the operations comprise counting the occurrence frequency, initial positions and the like of the words in the text document. In these operations, if the word sense is not considered, errors of some operations will be caused, thereby reducing the accuracy of the extraction of the theme meaning.

In order to solve the above problems, the present invention uses word senses instead of words because the word senses have only unique meanings. In the subject word sense extraction algorithm, word senses of candidate subject words are obtained by a disambiguation algorithm, and then the accuracy of the algorithm is improved by considering the correlation degree between the word senses in the steps of word sense combination and extraction.

Disclosure of Invention

Technical problem to be solved

In order to solve the problem that the existing theme meaning extraction algorithm is inaccurate due to the ambiguity of words, the invention provides the method for extracting the theme meaning by using the word meaning to replace the words, and the precision of extracting the theme meaning can be improved.

Technical scheme

The basic idea of the invention is: and converting the candidate subject words in the text document into candidate subject word senses, then extracting the candidate subject word senses, and finally outputting the subject word senses. In the whole process, word meaning is used for extracting the theme meaning instead of words, so that inaccuracy caused by word ambiguity is avoided in the process of expression of the theme meaning and algorithm processing.

The invention is technically characterized in that: converting the candidate subject term into a candidate subject term meaning by utilizing the context information of the candidate subject term and adopting disambiguation technology for subsequent processing; and in the process of extracting the theme meaning, the statistical information and the semantic information are considered at the same time, so that the precision of extracting the theme meaning is improved.

A method for automatically extracting the theme word meaning of a text document is characterized by comprising the following steps:

(1) respectively preprocessing each text document in the training text document set and the test text document set to obtain a candidate topic word meaning set of each text document;

the pretreatment comprises the following steps:

step a: extracting a candidate subject term set of the text document:

firstly, removing numbers and punctuation marks in a text document, and segmenting the text document into a set of existing words;

then, removing words which do not meet the conditions in the set;

finally, converting capital letters in the remaining words into lowercase letters, and removing prefixes and suffixes of the words to obtain a candidate subject word set of the text document;

the conditions are as follows: the number of letters forming a word is less than a preset value, or at least one lower case letter or a non-stop word is formed; the non-stop words refer to all words except stop words, and the stop words are fictional words; the preset value is 15 letters;

step b: adopting a disambiguation algorithm to obtain a candidate subject word sense set of the text document:

firstly, selecting words within a range of W from each candidate subject word in a candidate subject word set as the context of the candidate subject word; the value range of W is [6, 10 ];

then, a formula is calculated according to semantic relevance

Computing the kth possible sense s for each candidate subject term_kAnd the ith context c of the candidate subject word_iSemantic relatedness of rel(s)_k，c_i) And press

Computing the kth possible sense s of a candidate subject word_kTotal semantic relevance SenseCore(s) to all contexts of the candidate subject term_k)；

Wherein K is 1, 2, …, K is the number of possible meanings of the candidate subject word; i is 1, 2, …, and I is the number of the contexts of the candidate subject term; wordnumingloss Ofs_kDenotes s_kWordN ofThe number of words contained in the et definition, wordNumlGlossOfc_iDenotes c_iThe word number contained in the word notation of (1), NumOfOverlaps _ s_kc_iDenotes s_kWordNet definition of and c_iThe number of the same word in the words contained in the WordNet paraphrase of (1); the possible word senses are defined in a lexical database WordNet;

finally, selecting the possible word sense with the maximum total semantic relevance SenseScore value as a candidate subject word sense of the candidate subject word to obtain a candidate subject word sense set of the text document;

step c: merging candidate topic senses:

formula for calculating according to semantic relevance

Computing any two candidate topic word senses in a set of candidate topic word sensesAnd

removing any one of the two candidate topic word senses with the semantic relevance value larger than a given threshold lambda; the value range of the threshold lambda is [0.5, 0.8 ]]；

Wherein,

p≠q，

the number of candidate topic word senses in the candidate topic word sense set;

to representThe number of words contained in the WordNet paraphrase of (1),

to representThe number of words contained in the WordNet paraphrase of (1),to representThe word DNA of

The number of the same word in the words contained in the WordNet paraphrase of (1);

(2) calculating the characteristic attribute value of each candidate topic word sense in the candidate topic word sense set in the text document; the characteristic attributes comprise: the frequency tf x idf of occurrence of the candidate topic word sense in the text document, the average position fo of the first occurrence of the candidate topic word sense in the text document, the number of letters len contained in the candidate topic word sense, and the cohesiveness coh between the candidate topic word senses;

the calculation formula of the frequency tf multiplied by idf of the candidate topic word senses appearing in the text document is as follows:

wherein cs is_jJ is the jth candidate topic word sense in the candidate topic word sense set of the text document, wherein J is 1, 2, …, J is the number of candidate topic word senses in the candidate topic word sense set of the text document; f (cs)_j) Is cs_jThe number of occurrences in a text document, D represents a set of text documents, | D | is the text document space in D, | D (cs)_j) I is the inclusion of a candidate topic meaning cs in D_jText document length of (1);

the calculation formula of the average position fo of the candidate topic word senses appearing for the first time in the text document is as follows:

fo(cs_j)＝O_first/J

wherein, O_firstAs candidate topic senses cs_jA location of a first occurrence in a text document;

the calculation formula of the cohesiveness coh among the candidate topic word senses is as follows:

<math><mrow><mi>coh</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>l</mi><mo>=</mo><mn>1</mn><mo>,</mo><mi>l</mi><mo>&NotEqual;</mo><mi>j</mi></mrow><mi>J</mi></munderover><mi>rel</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>,</mo><msub><mi>cs</mi><mi>l</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>J</mi><mo>-</mo><mn>1</mn></mrow></mfrac></mrow></math>

wherein rel (cs)_j，cs_l) For calculating formulas according to semantic relevance

Candidate topic sense cs in candidate topic sense set of text document obtained by calculation_jAnd candidate topic sense cs_lThe semantic relatedness of (c); wordNumInGlossOfcs_jRepresentation cs_jThe word number contained in the WordNet definition of (1), wordNumlingGlossOfcs_lRepresentation cs_lThe word number contained in the word notation of (1), NumOfOverlaps _ cs_jcs_lRepresentation cs_jWordNet explaination of and cs_lThe number of the same word in the words contained in the WordNet paraphrase of (1);

(3) extracting a final theme word meaning set of each text document in the test text document set by using a Bayesian model:

firstly, calculating the probability Pr of each candidate topic word sense of each text document in the test text document set as the topic word sense according to Pr ═ Pr [ T | yes ] × Pr [ O | yes ] × Pr [ L | yes ] × Pr [ C | yes ] × Pr [ yes ];

wherein Pr [ T | yes ], Pr [ O | yes ], Pr [ L | yes ] and Pr [ C | yes ] respectively represent the probability that the candidate subject word senses are subject word senses under the condition of having the characteristic attribute values of tf x idf, fo, len and coh, and Pr [ yes ] represents the proportion of the number of the text documents in the training text document set with the candidate subject word sense as the subject word sense to the number of the text documents in the training text document set with the candidate subject word sense not as the subject word sense;

then, sequencing all candidate subject word senses of the text document from large to small according to the probability Pr value;

finally, the user-set number of candidate topic word senses ranked first are selected to form the final topic word sense set for the text document.

The calculation formulas of Pr [ T | yes ], Pr [ O | yes ], Pr [ L | yes ], Pr [ C | yes ] and Pr [ yes ] are respectively:

<math><mrow><mi>Pr</mi><mo>[</mo><mi>T</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mrow><mi>tf</mi><mo>×</mo><mi>idf</mi></mrow><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>/</mo><mi>tf</mi><mo>×</mo><msup><mi>idf</mi><msup><mi>d</mi><mo>′</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow></math>

<math><mrow><mi>Pr</mi><mo>[</mo><mi>O</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mi>fo</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>/</mo><msup><mi>fo</mi><msup><mi>d</mi><mo>′</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow></math>

<math><mrow><mi>Pr</mi><mo>[</mo><mi>L</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mi>len</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>/</mo><msup><mi>len</mi><msup><mi>d</mi><mo>′</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow></math>

<math><mrow><mi>Pr</mi><mo>[</mo><mi>C</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mi>coh</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>/</mo><msup><mi>coh</mi><msup><mi>d</mi><mo>′</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow></math>

\Pr [yes] = \frac{| T^{1} |}{| T^{0} |}

wherein d 'is a text document in the test text document set, cs'_mThe M-th candidate topic word sense of the text document d 'is defined, wherein M is 1, 2, …, and M is the number of candidate topic word senses in the candidate topic word sense set of the text document d'; tf x idf^d′(cs′_m)、fo^d′(cs′_m)、len^d′(cs′_m) And coh^d′(cs′_m) Are respectively candidate topic word sense cs'_mTf x idf, fo, len, coh feature attribute values in the text document d';

and

are respectively cs'_mAverage tf × idf, fo, len, coh feature attribute values in the text document set T1; the text document set T¹Is the candidate topic word sense cs 'from the set of training text documents'_mA set of text documents that are subject word senses; the text document set T⁰Is the candidate topic word sense cs 'from the set of training text documents'_mA collection of text documents that are not subject word senses;

and

the calculation formulas of (A) and (B) are respectively as follows:

<math><mrow><msup><mover><mrow><mi>tf</mi><mo>×</mo><mi>idf</mi></mrow><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><mi>tf</mi><mo>×</mo><msubsup><mi>idf</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>

<math><mrow><msup><mover><mi>fo</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>fo</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>

<math><mrow><msup><mover><mi>len</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>len</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>

<math><mrow><msup><mover><mi>coh</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>coh</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>

wherein,

and

are respectively candidate topic word sense cs'_mIn a text document set T¹And the characteristic attribute values of tf multiplied by idf, fo, len and coh in the nth text document.

Has the advantages that:

the invention provides an automatic extraction method of text document theme word senses, which uses word senses to replace words for processing, and solves the problems of inaccurate expression of theme meaning and misoperation in the extraction process caused by word ambiguity, thereby improving the accuracy of an algorithm. In addition, in the extraction process, the invention considers statistical information (Bayesian estimation probability) and semantic information (word meaning) at the same time, thereby further improving the accuracy of the algorithm.

Drawings

FIG. 1: basic flow diagram of the method of the invention

FIG. 2: experimental result chart for extracting subject word meaning by using method of the invention

Detailed Description

Given a set of training text documents T ═ T₁，…，t_|T|And a set of text documents to be extracted (test text document set) E ═ E { (E)₁，…，e_|E|And processing each text document in the T and the E according to the following steps I and II, specifically:

the method comprises the following steps: and preprocessing the text document. For text document T in T_i(i ═ 1, …, | T |, is the text document number in the text document set T), first use step 1.1 to get the candidate subject word of the text document, then use step 1.2 to get the candidate subject word meaning, finally use step 1.3 to merge the candidate subject word meaning, get the text document T_iThe final set of candidate topic word senses.

Step 1.1: and acquiring candidate subject terms. First, the text document t is removed_iThe numbers and various punctuation marks in (a), represent the text document as a collection of words: t is t_i＝{w₁，…，w_ij… }; then, for each word w in the set of words_ijThe invention adopts the following rules to judge whether the candidate subject term is a candidate subject term: if the composition w_ijIs greater than a predetermined value L (here, L ═ 15), or constitutes w_ijAll the letters of (A) are capital letters, or w_ijFor stop words (i.e. imaginary words including articles, pronouns, etc.), then w_ijCan not be candidate subject words, willIt is from the set w₁，…，w_ij… } is removed; finally, set { w₁，…，w_ij…, and removing the suffix from the words, i.e. each candidate subject word is represented in root form, to obtain the text document t_iIs selected from the candidate subject word set CW_i＝{cw₁，…，cw_ij，…}。

Step 1.2: candidate topic word senses are obtained. For text documents t_iIs selected from the candidate subject word set CW_i＝{cw₁，…，cw_ij… } of the candidate subject word cw_ij(j＝1，…，|CW_i|，|CW_iI is candidate subject word set CW_iThe number of candidate subject terms in the text document t) is obtained by the disambiguation algorithm in the invention_iThe correct sense in (1).

First, in CW_iIn (1), cw is selected_ijAll words in the W distance range of (1) are context, giving cw_ijContext set of(|C_ijI is a context set C_ijThe number of words in) and

is the candidate subject word cw_ijIn which | S_ijI is a set of possible word senses S_ijThe number of possible word senses, where a possible word sense is a word sense of a candidate subject word defined in the lexical database WordNet; then, the candidate subject word cw is calculated as follows_ijOf the kth possible sense s_ijkWith its l context c_ijlSemantic relatedness of rel(s)_ijk，c_ijl)：

rel (s_{ijk}, c_{ijl}) = \frac{{NumOfOverlaps_s}_{ijk} c_{ijl}}{({wordNumInGlossOfs}_{ijk} + {wordNumInGlossOfc}_{ijl}) / 2} - - - (1)

Wherein, wordNumlInGlossOfs_ijkDenotes s_ijkThe word number contained in the word notation of (1), word NumlInGlossOfc_ijlDenotes c_ijlThe word number contained in the word notation of (1), NumOfOverlaps _ s_ijkc_ijlDenotes s_ijkWordNet definition of and c_ijlThe number of the same word in the words contained in the WordNet paraphrase of (1);

then, each possible word sense s is obtained according to the following formula_ijkAnd all the contexts c in the context set_ijl(l＝1，…，|C_ijOverall semantic relatedness SenseScore(s)_ijk)：

<math><mrow><mi>SenseScore</mi><mrow><mo>(</mo><msub><mi>s</mi><mi>ijk</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>Σ</mi><mrow><mi>l</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msub><mi>C</mi><mi>ij</mi></msub><mo>|</mo></mrow></munderover><mi>rel</mi><mrow><mo>(</mo><msub><mi>s</mi><mi>ijk</mi></msub><mo>,</mo><msub><mi>c</mi><mi>ijl</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>2</mn><mo>)</mo></mrow></mrow></math>

Finally, selecting the possible word meaning with the maximum total semantic relevance SenseScore value as the candidate subject word cw_ijOf correct sense, i.e. candidate subject word cw_ijThe candidate topic senses of (1).

The text document t is obtained by calculation by adopting the method_iIs selected from the candidate subject word set CW_i＝{cw₁，…，cw_ij… } of all candidate subject words cw_ij(j＝1，2，…，|CW_i|) candidate topic word senses to form a text document t_iIs recorded as the set of candidate topic word senses

Wherein, | CS_iL is set CS_iThe number of candidate topic senses in.

Step 1.3: merging candidate topic senses. For a set of candidate topic word senses CS_iAny two candidate topic senses cs in_ipAnd cs_iq(p，q＝1，2，…，|CS_iL, p ≠ q), and the semantic relevance rel (cs) between the l, p ≠ q is calculated according to the formula (3)_ip cs_iq) If rel (cs)_ip，cs_iq) Lambda (lambda is a given threshold), the corresponding candidate subject word cw is considered_ipAnd cw_iqAre semantically identical, will cs_ipAnd cs_iqAs identical candidate topic word senses, i.e. in the set of candidate topic word senses CS_iMiddle deletion cs_ipOr cs_iq。

rel ({cs}_{ip}, {cs}_{iq}) = \frac{{NumOfOverlaps_cs}_{ip} {cs}_{iq}}{({wordNumInGlossOfcs}_{ip} + {wordNumInGlossOfcs}_{iq}) / 2} - - - (3)

Wherein, wordNumlInGlossOfcs_ipRepresentation cs_ipWordNE ofthe number of words contained in the t-definition, wordNumlGlossOfcs_iqRepresentation cs_iqThe word number contained in the word notation of (1), NumOfOverlaps _ cs_ipcs_iqRepresentation cs_ipWordNet explaination of and cs_iqThe number of identical words in the words contained in the WordNet definition of (1).

Step two: and calculating the characteristic attribute. For the text document t obtained in the step one_iOf a set of candidate topic word senses CS_iEach candidate topic sense cs in_im(m＝1，2，…，|CS_i|，|CS_iL is CS_iThe number of the candidate topic word senses) respectively calculate four characteristic attribute values thereof, namely the frequency tf x idf of the candidate topic word sense appearing in the text document, the average position fo of the candidate topic word sense appearing in the text document for the first time, the cohesion coh between the letter number len contained in the candidate topic word sense and the candidate topic word sense, and the candidate topic word sense cs_imThe specific calculation formula of the tf × idf, fo and coh attribute values is as follows:

fo(cs_im)＝O_first/|CS_i| (5)

<math><mrow><mi>coh</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>im</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>p</mi><mo>=</mo><mn>1</mn><mo>,</mo><mi>p</mi><mo>&NotEqual;</mo><mi>m</mi></mrow><mrow><mo>|</mo><mi>C</mi><msub><mi>S</mi><mi>i</mi></msub><mo>|</mo></mrow></munderover><mi>rel</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>im</mi></msub><mo>,</mo><msub><mi>cs</mi><mi>ip</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><mi>C</mi><msub><mi>S</mi><mi>i</mi></msub><mo>|</mo><mo>-</mo><mn>1</mn></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>6</mn><mo>)</mo></mrow></mrow></math>

wherein, f (cs)_im) For the candidate topic sense cs_imIn a text document t_iThe number of occurrences, | T | is the text document space number in the text document set T, | T (cs)_im) | is the inclusion of the candidate topic sense cs in the text document set T_imText document length of (1); o is_firstFor the candidate topic sense cs_imIn a text document t_iThe position of the first occurrence in; rel (cs)_im，cs_ip) The semantic relatedness between candidate topic word senses calculated for equation (3).

For the text document set E to be extracted, { E ═ E₁，…，e_i，…，e_|E|Each text document in the (i.e. the test text document set) is also processed by the above step one and step two. Wherein for E ═ { E ═ E₁，…，e_i，…，e_|E|Each text document e in_iTo obtain the candidate topic meaning set

And wherein each candidate topic word sense ce_ijFour characteristic attribute values of (1): tf x idf (ce)_ij)、fo(ce_ij)、len(ce_ij) And coh (ce)_ij). Extracting a text document set E-E to be extracted by adopting a Bayesian estimation method in the third step₁，…，e_i，…，e_|E|The subject word senses of (i.e., the set of test text documents). The method specifically comprises the following steps:

step three: and extracting the theme word meaning. Since the topic senses of the training text document set are known, for the text document E in the test text document set E_iEach candidate topic sense ce_ijFirstly, a training text document set T is divided into two types according to whether the training text document set T is the theme word meaning of the training text document: for the text document T in the training text document set T_iIf the candidate subject word sense ce_ijIs t_iThe subject word sense of (1), then the text document t_iSet of text documents T falling under the first category¹(ii) a If the candidate subject word sense ce_ijIs not t_iThe subject word sense of (1), then the text document t_iSet T of text documents classified into the second category⁰. Then, ce is calculated by the following formula_ijIn the set T¹Average attribute value of

<math><mrow><msup><mover><mrow><mi>tf</mi><mo>×</mo><mi>idf</mi></mrow><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>u</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><mi>tf</mi><mo>×</mo><msubsup><mi>idf</mi><mi>u</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>7</mn><mo>)</mo></mrow></mrow></math>

<math><mrow><msup><mover><mi>fo</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>u</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>fo</mi><mi>u</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>8</mn><mo>)</mo></mrow></mrow></math>

<math><mrow><msup><mover><mi>len</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>u</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>len</mi><mi>u</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>9</mn><mo>)</mo></mrow></mrow></math>

<math><mrow><msup><mover><mi>coh</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>u</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>coh</mi><mi>u</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>10</mn><mo>)</mo></mrow></mrow></math>

Wherein,

are respectively ce_ijIn the set T¹The u-th text document t in (1)_uTf × idf, fo, len, coh attribute values in (1);

finally, candidate topic word senses ce are calculated according to the following formula_ijIn a text document e_iProbability of being the final topic meaning Pr:

Pr＝Pr[T|yes]×Pr[O|yes]×Pr[L|yes]×Pr[C|yes]×Pr[yes] (11)

wherein, Pr [ T | yes]、Pr[O|yes]、Pr[L|yes]And Pr [ C ] yes]Respectively representing text documents E in a test text document set E_iCe of candidate topic word_ijBayesian estimation probability, Pr [ yes ], of subject meaning under the condition of having current characteristic attribute values tf multiplied by idf, fo, len and coh]A ratio representing a number of text documents in the set of training text documents for which the candidate subject word sense is a subject word sense to a number of text documents in the set of training text documents for which the candidate subject word sense is not a subject word sense;

<math><mrow><mi>Pr</mi><mo>[</mo><mi>T</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><mi>tf</mi><mo>×</mo><msup><mi>idf</mi><msub><mi>e</mi><mi>i</mi></msub></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>/</mo><msup><mover><mrow><mi>tf</mi><mo>×</mo><mi>idf</mi></mrow><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>12</mn><mo>)</mo></mrow></mrow></math>

<math><mrow><mi>Pr</mi><mo>[</mo><mi>O</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mi>fo</mi><msub><mi>e</mi><mi>i</mi></msub></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>/</mo><msup><mover><mi>fo</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>13</mn><mo>)</mo></mrow></mrow></math>

<math><mrow><mi>Pr</mi><mo>[</mo><mi>L</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mi>len</mi><msub><mi>e</mi><mi>i</mi></msub></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>/</mo><msup><mover><mi>len</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>14</mn><mo>)</mo></mrow></mrow></math>

<math><mrow><mi>Pr</mi><mo>[</mo><mi>C</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mi>coh</mi><msub><mi>e</mi><mi>i</mi></msub></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>/</mo><msup><mover><mi>coh</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>15</mn><mo>)</mo></mrow></mrow></math>

Pr[yes]＝|T¹|/|T⁰| (16)

wherein,

are respectively ce_ijText in test text document set EGear e_iTf × idf, fo, len, coh attribute values in (1); i T¹I and I T⁰L are respectively the set T¹And T⁰The text document space contained therein.

Calculating each text document e in the text document set (i.e. the test document set) to be extracted by adopting the method_iThe probability Pr of all candidate subject word senses in the candidate subject word sense set becoming final subject word senses is sorted from large to small according to the Pr value, and the N candidate main body word senses which are sorted at the front are taken as the extracted text document e according to the requirement_iThe subject sense of (1).

Example experiment: we implemented the invention using Java programs and then performed a set of experiments to evaluate the invention in which the threshold λ was set to 0.9. The experimental data is 500 text documents containing subject words randomly downloaded from an online text document database maintained by UN Food and agricultural organizations. These text documents contain an average of 4.95 subject words. 300 text documents were used to train the model and 200 other text documents were used for testing.

Precision, Recall, and integrated F-measure are used to evaluate the subject sense extraction algorithm.

Precision = \frac{correct_extracted_keywords}{all_extracted_keywords} - - - (17)

Recall = \frac{correct_extracted_keywords}{manually_assigned_keywords} - - - (18)

<math><mrow><mi>F</mi><mo>-</mo><mi>measure</mi><mo>=</mo><mfrac><mrow><mn>2</mn><mo>×</mo><mi>Precision</mi><mo>×</mo><mi>Recall</mi></mrow><mrow><mi>Precision</mi><mo>+</mo><mi>Recall</mi></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>19</mn><mo>)</mo></mrow></mrow></math>

Wherein, correct _ extracted _ keywords is the number of correctly extracted topic senses, all _ extracted _ keywords is the number of all the extracted topic senses, and manual _ associated _ keywords is the number of manually assigned topic senses.

Equations (17), (18) and (19) are used to evaluate each text document, and the final Precision, Recall and F-measure are the average values of the entire set of test text documents.

Description figure 2 gives the results of the experiment. The horizontal axis represents the total number of subject senses extracted by the method of the present invention, which ranges from 1 to 20, and the vertical axis represents the average number of correct subject senses extracted. As can be seen from the figure, when the number of extracted total subject word senses is 5, the number of correct subject word senses is about 3, and an accuracy rate of about 60% is achieved; when the number of the extracted total subject word senses is 9, the number of correct subject word senses is about 4, and the accuracy rate of about 80% is achieved; when the number of extracted total subject word senses is 15, the number of correct subject word senses is about 4.5, and the accuracy rate of 90% is achieved. The above analysis shows that the subject word sense extraction method of the invention has better performance.

Sequentially selecting word senses of the top five ranked words from a theme word sense set extracted from each text document; then, Precision, Recall, and F-measure of each text document are calculated using the evaluation formulas (17), (18), and (19); finally, the average of the performance with respect to all text documents was calculated, and the final result is shown in table 1.

TABLE 1 Performance of the subject sense extraction Algorithm

Topic word meaning extraction algorithm	Pr ecision	Recall	F-measure
Topic word meaning extraction algorithm	Pr ecision	Recall	F-measure	5 subject word senses	0.595	0.612	0.603

As can be seen from evaluation experiments, the method for extracting the theme meaning has better performance and higher accuracy and recall rate, and can be applied to automatic theme meaning extraction of the text document. This is mainly because the present invention uses word senses instead of words for processing, thereby being able to more accurately acquire the subject meaning of a text document. As can be seen from FIG. 2, when the total number of the subject word senses extracted by the algorithm reaches 9, the accuracy can reach 80%, so the method can also be applied to semi-automatic text document subject labeling, a plurality of subject word senses are generated by using the method, and then the screening is performed by a user.

Claims

1. A method for automatically extracting the theme word meaning of a text document is characterized by comprising the following steps:

the pretreatment comprises the following steps:

step a: extracting a candidate subject term set of the text document:

then, removing words which do not meet the conditions in the set;

then, a formula is calculated according to semantic relevance

Wherein K is 1, 2, …, K is the number of possible meanings of the candidate subject word; i is 1, 2, …, and I is the number of the contexts of the candidate subject term; wordnumingloss Ofs_kDenotes s_kThe word number contained in the word notation of (1), word NumlInGlossOfc_iDenotes c_iThe word number contained in the word notation of (1), NumOfOverlaps _ s_kc_iDenotes s_kWordNet definition of and c_iWord definition ofThe number of the same words in the contained words; the possible word senses are defined in a lexical database WordNet;

step c: merging candidate topic senses:

formula for calculating according to semantic relevance

Computing any two candidate topic word senses in a set of candidate topic word senses

Andremoving any one of the two candidate topic word senses with the semantic relevance value larger than a given threshold lambda; the value range of the threshold lambda is [0.5, 0.8 ]]；

Wherein,

p≠q，

to represent

The number of words contained in the WordNet paraphrase of (1),

to represent

The number of words contained in the WordNet paraphrase of (1),

to representThe word DNA ofThe number of the same word in the words contained in the WordNet paraphrase of (1);

wherein cs is_jIs the jth candidate topic word sense in the candidate topic word sense set of the text document, J is 1, 2, …, J, J is the candidate main word of the text documentThe number of candidate subject word senses in the term sense set; f (cs)_j) Is cs_jThe number of occurrences in a text document, D represents a set of text documents, | D | is the text document space in D, | D (cs)_j) I is the inclusion of a candidate topic meaning cs in D_jText document length of (1);

fo(cs_j)＝O_first/J

Candidate topic sense cs in candidate topic sense set of text document obtained by calculation_jAnd candidate topic sense cs_lThe semantic relatedness of (c); wordNumInGlossOfcs_jRepresentation cs_jThe word number contained in the WordNet definition of (1), wordNumlingGlossOfcs_lRepresentation cs_lThe word number contained in the word notation of (1), NumOfOverlaps _ cs_jcs_lRepresentation cs_jWordNet explaination of and cs_lIncluded in the WordNet definition ofThe number of the same words in the words;

2. The method of claim 1, wherein the method comprises the following steps: the calculation formulas of Pr [ T | yes ], Pr [ O | yes ], Pr [ L | yes ], Pr [ C | yes ] and Pr [ yes ] are respectively:

\Pr [yes] = \frac{| T^{1} |}{| T^{0} |}

wherein d' is a text document in the test text document set,cs′_mthe M-th candidate topic word sense of the text document d 'is defined, wherein M is 1, 2, …, and M is the number of candidate topic word senses in the candidate topic word sense set of the text document d'; tf x idf^d′(cs′_m)、fo^d′(cs′_m)、len^d′(cs′_m) And coh^d′(cs′_m) Are respectively candidate topic word sense cs'_mTf x idf, fo, len, coh feature attribute values in the text document d';

and

are respectively cs'_mIn a text document set T¹The average tf x idf, fo, len, coh characteristic attribute values in (1); the text document set T¹Is the candidate topic word sense cs 'from the set of training text documents'_mA set of text documents that are subject word senses; the text document set T⁰Is the candidate topic word sense cs 'from the set of training text documents'_mA collection of text documents that are not subject word senses;

and

the calculation formulas of (A) and (B) are respectively as follows:

wherein,

and