Disclosure of Invention
Technical problem to be solved
In order to solve the problem that the existing theme meaning extraction algorithm is inaccurate due to the ambiguity of words, the invention provides the method for extracting the theme meaning by using the word meaning to replace the words, and the precision of extracting the theme meaning can be improved.
Technical scheme
The basic idea of the invention is: and converting the candidate subject words in the text document into candidate subject word senses, then extracting the candidate subject word senses, and finally outputting the subject word senses. In the whole process, word meaning is used for extracting the theme meaning instead of words, so that inaccuracy caused by word ambiguity is avoided in the process of expression of the theme meaning and algorithm processing.
The invention is technically characterized in that: converting the candidate subject term into a candidate subject term meaning by utilizing the context information of the candidate subject term and adopting disambiguation technology for subsequent processing; and in the process of extracting the theme meaning, the statistical information and the semantic information are considered at the same time, so that the precision of extracting the theme meaning is improved.
A method for automatically extracting the theme word meaning of a text document is characterized by comprising the following steps:
(1) respectively preprocessing each text document in the training text document set and the test text document set to obtain a candidate topic word meaning set of each text document;
the pretreatment comprises the following steps:
step a: extracting a candidate subject term set of the text document:
firstly, removing numbers and punctuation marks in a text document, and segmenting the text document into a set of existing words;
then, removing words which do not meet the conditions in the set;
finally, converting capital letters in the remaining words into lowercase letters, and removing prefixes and suffixes of the words to obtain a candidate subject word set of the text document;
the conditions are as follows: the number of letters forming a word is less than a preset value, or at least one lower case letter or a non-stop word is formed; the non-stop words refer to all words except stop words, and the stop words are fictional words; the preset value is 15 letters;
step b: adopting a disambiguation algorithm to obtain a candidate subject word sense set of the text document:
firstly, selecting words within a range of W from each candidate subject word in a candidate subject word set as the context of the candidate subject word; the value range of W is [6, 10 ];
then, a formula is calculated according to semantic relevance
Computing the kth possible sense s for each candidate subject term
kAnd the ith context c of the candidate subject word
iSemantic relatedness of rel(s)
k,c
i) And press
Computing the kth possible sense s of a candidate subject word
kTotal semantic relevance SenseCore(s) to all contexts of the candidate subject term
k);
Wherein K is 1, 2, …, K is the number of possible meanings of the candidate subject word; i is 1, 2, …, and I is the number of the contexts of the candidate subject term; wordnumingloss OfskDenotes skWordN ofThe number of words contained in the et definition, wordNumlGlossOfciDenotes ciThe word number contained in the word notation of (1), NumOfOverlaps _ skciDenotes skWordNet definition of and ciThe number of the same word in the words contained in the WordNet paraphrase of (1); the possible word senses are defined in a lexical database WordNet;
finally, selecting the possible word sense with the maximum total semantic relevance SenseScore value as a candidate subject word sense of the candidate subject word to obtain a candidate subject word sense set of the text document;
step c: merging candidate topic senses:
formula for calculating according to semantic relevance
Computing any two candidate topic word senses in a set of candidate topic word senses
And
removing any one of the two candidate topic word senses with the semantic relevance value larger than a given threshold lambda; the value range of the threshold lambda is [0.5, 0.8 ]];
Wherein,
p≠q,
the number of candidate topic word senses in the candidate topic word sense set;
to represent
The number of words contained in the WordNet paraphrase of (1),
to represent
The number of words contained in the WordNet paraphrase of (1),
to represent
The word DNA of
The number of the same word in the words contained in the WordNet paraphrase of (1);
(2) calculating the characteristic attribute value of each candidate topic word sense in the candidate topic word sense set in the text document; the characteristic attributes comprise: the frequency tf x idf of occurrence of the candidate topic word sense in the text document, the average position fo of the first occurrence of the candidate topic word sense in the text document, the number of letters len contained in the candidate topic word sense, and the cohesiveness coh between the candidate topic word senses;
the calculation formula of the frequency tf multiplied by idf of the candidate topic word senses appearing in the text document is as follows:
<math><mrow><mi>tf</mi><mo>×</mo><mi>idf</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>f</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>×</mo><mi>log</mi><mfrac><mrow><mo>|</mo><mi>D</mi><mo>|</mo></mrow><mrow><mo>|</mo><mi>D</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>|</mo></mrow></mfrac></mrow></math>
wherein cs isjJ is the jth candidate topic word sense in the candidate topic word sense set of the text document, wherein J is 1, 2, …, J is the number of candidate topic word senses in the candidate topic word sense set of the text document; f (cs)j) Is csjThe number of occurrences in a text document, D represents a set of text documents, | D | is the text document space in D, | D (cs)j) I is the inclusion of a candidate topic meaning cs in DjText document length of (1);
the calculation formula of the average position fo of the candidate topic word senses appearing for the first time in the text document is as follows:
fo(csj)=Ofirst/J
wherein, OfirstAs candidate topic senses csjA location of a first occurrence in a text document;
the calculation formula of the cohesiveness coh among the candidate topic word senses is as follows:
<math><mrow><mi>coh</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>l</mi><mo>=</mo><mn>1</mn><mo>,</mo><mi>l</mi><mo>≠</mo><mi>j</mi></mrow><mi>J</mi></munderover><mi>rel</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>,</mo><msub><mi>cs</mi><mi>l</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>J</mi><mo>-</mo><mn>1</mn></mrow></mfrac></mrow></math>
wherein rel (cs)j,csl) For calculating formulas according to semantic relevance
Candidate topic sense cs in candidate topic sense set of text document obtained by calculation
jAnd candidate topic sense cs
lThe semantic relatedness of (c); wordNumInGlossOfcs
jRepresentation cs
jThe word number contained in the WordNet definition of (1), wordNumlingGlossOfcs
lRepresentation cs
lThe word number contained in the word notation of (1), NumOfOverlaps _ cs
jcs
lRepresentation cs
jWordNet explaination of and cs
lThe number of the same word in the words contained in the WordNet paraphrase of (1);
(3) extracting a final theme word meaning set of each text document in the test text document set by using a Bayesian model:
firstly, calculating the probability Pr of each candidate topic word sense of each text document in the test text document set as the topic word sense according to Pr ═ Pr [ T | yes ] × Pr [ O | yes ] × Pr [ L | yes ] × Pr [ C | yes ] × Pr [ yes ];
wherein Pr [ T | yes ], Pr [ O | yes ], Pr [ L | yes ] and Pr [ C | yes ] respectively represent the probability that the candidate subject word senses are subject word senses under the condition of having the characteristic attribute values of tf x idf, fo, len and coh, and Pr [ yes ] represents the proportion of the number of the text documents in the training text document set with the candidate subject word sense as the subject word sense to the number of the text documents in the training text document set with the candidate subject word sense not as the subject word sense;
then, sequencing all candidate subject word senses of the text document from large to small according to the probability Pr value;
finally, the user-set number of candidate topic word senses ranked first are selected to form the final topic word sense set for the text document.
The calculation formulas of Pr [ T | yes ], Pr [ O | yes ], Pr [ L | yes ], Pr [ C | yes ] and Pr [ yes ] are respectively:
<math><mrow><mi>Pr</mi><mo>[</mo><mi>T</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mrow><mi>tf</mi><mo>×</mo><mi>idf</mi></mrow><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>/</mo><mi>tf</mi><mo>×</mo><msup><mi>idf</mi><msup><mi>d</mi><mo>′</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>O</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mi>fo</mi><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>/</mo><msup><mi>fo</mi><msup><mi>d</mi><mo>′</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>L</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mi>len</mi><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>/</mo><msup><mi>len</mi><msup><mi>d</mi><mo>′</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>C</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mi>coh</mi><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>/</mo><msup><mi>coh</mi><msup><mi>d</mi><mo>′</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow></math>
wherein d 'is a text document in the test text document set, cs'
mThe M-th candidate topic word sense of the text document d 'is defined, wherein M is 1, 2, …, and M is the number of candidate topic word senses in the candidate topic word sense set of the text document d'; tf x idf
d′(cs′
m)、fo
d′(cs′
m)、len
d′(cs′
m) And coh
d′(cs′
m) Are respectively candidate topic word sense cs'
mTf x idf, fo, len, coh feature attribute values in the text document d';
and
are respectively cs'
mAverage tf × idf, fo, len, coh feature attribute values in the text document set T1; the text document set T
1Is the candidate topic word sense cs 'from the set of training text documents'
mA set of text documents that are subject word senses; the text document set T
0Is the candidate topic word sense cs 'from the set of training text documents'
mA collection of text documents that are not subject word senses;
and
the calculation formulas of (A) and (B) are respectively as follows:
<math><mrow><msup><mover><mrow><mi>tf</mi><mo>×</mo><mi>idf</mi></mrow><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><mi>tf</mi><mo>×</mo><msubsup><mi>idf</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>
<math><mrow><msup><mover><mi>fo</mi><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>fo</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>
<math><mrow><msup><mover><mi>len</mi><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>len</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>
<math><mrow><msup><mover><mi>coh</mi><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>coh</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>′</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>
wherein,
and
are respectively candidate topic word sense cs'
mIn a text document set T
1And the characteristic attribute values of tf multiplied by idf, fo, len and coh in the nth text document.
Has the advantages that:
the invention provides an automatic extraction method of text document theme word senses, which uses word senses to replace words for processing, and solves the problems of inaccurate expression of theme meaning and misoperation in the extraction process caused by word ambiguity, thereby improving the accuracy of an algorithm. In addition, in the extraction process, the invention considers statistical information (Bayesian estimation probability) and semantic information (word meaning) at the same time, thereby further improving the accuracy of the algorithm.
Detailed Description
Given a set of training text documents T ═ T1,…,t|T|And a set of text documents to be extracted (test text document set) E ═ E { (E)1,…,e|E|And processing each text document in the T and the E according to the following steps I and II, specifically:
the method comprises the following steps: and preprocessing the text document. For text document T in Ti(i ═ 1, …, | T |, is the text document number in the text document set T), first use step 1.1 to get the candidate subject word of the text document, then use step 1.2 to get the candidate subject word meaning, finally use step 1.3 to merge the candidate subject word meaning, get the text document TiThe final set of candidate topic word senses.
Step 1.1: and acquiring candidate subject terms. First, the text document t is removediThe numbers and various punctuation marks in (a), represent the text document as a collection of words: t is ti={w1,…,wij… }; then, for each word w in the set of wordsijThe invention adopts the following rules to judge whether the candidate subject term is a candidate subject term: if the composition wijIs greater than a predetermined value L (here, L ═ 15), or constitutes wijAll the letters of (A) are capital letters, or wijFor stop words (i.e. imaginary words including articles, pronouns, etc.), then wijCan not be candidate subject words, willIt is from the set w1,…,wij… } is removed; finally, set { w1,…,wij…, and removing the suffix from the words, i.e. each candidate subject word is represented in root form, to obtain the text document tiIs selected from the candidate subject word set CWi={cw1,…,cwij,…}。
Step 1.2: candidate topic word senses are obtained. For text documents tiIs selected from the candidate subject word set CWi={cw1,…,cwij… } of the candidate subject word cwij(j=1,…,|CWi|,|CWiI is candidate subject word set CWiThe number of candidate subject terms in the text document t) is obtained by the disambiguation algorithm in the inventioniThe correct sense in (1).
First, in CW
iIn (1), cw is selected
ijAll words in the W distance range of (1) are context, giving cw
ijContext set of
(|C
ijI is a context set C
ijThe number of words in) and
is the candidate subject word cw
ijIn which | S
ijI is a set of possible word senses S
ijThe number of possible word senses, where a possible word sense is a word sense of a candidate subject word defined in the lexical database WordNet; then, the candidate subject word cw is calculated as follows
ijOf the kth possible sense s
ijkWith its l context c
ijlSemantic relatedness of rel(s)
ijk,c
ijl):
Wherein, wordNumlInGlossOfsijkDenotes sijkThe word number contained in the word notation of (1), word NumlInGlossOfcijlDenotes cijlThe word number contained in the word notation of (1), NumOfOverlaps _ sijkcijlDenotes sijkWordNet definition of and cijlThe number of the same word in the words contained in the WordNet paraphrase of (1);
then, each possible word sense s is obtained according to the following formulaijkAnd all the contexts c in the context setijl(l=1,…,|CijOverall semantic relatedness SenseScore(s)ijk):
<math><mrow><mi>SenseScore</mi><mrow><mo>(</mo><msub><mi>s</mi><mi>ijk</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>Σ</mi><mrow><mi>l</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msub><mi>C</mi><mi>ij</mi></msub><mo>|</mo></mrow></munderover><mi>rel</mi><mrow><mo>(</mo><msub><mi>s</mi><mi>ijk</mi></msub><mo>,</mo><msub><mi>c</mi><mi>ijl</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>2</mn><mo>)</mo></mrow></mrow></math>
Finally, selecting the possible word meaning with the maximum total semantic relevance SenseScore value as the candidate subject word cwijOf correct sense, i.e. candidate subject word cwijThe candidate topic senses of (1).
The text document t is obtained by calculation by adopting the method
iIs selected from the candidate subject word set CW
i={cw
1,…,cw
ij… } of all candidate subject words cw
ij(j=1,2,…,|CW
i|) candidate topic word senses to form a text document t
iIs recorded as the set of candidate topic word senses
Wherein, | CS
iL is set CS
iThe number of candidate topic senses in.
Step 1.3: merging candidate topic senses. For a set of candidate topic word senses CSiAny two candidate topic senses cs inipAnd csiq(p,q=1,2,…,|CSiL, p ≠ q), and the semantic relevance rel (cs) between the l, p ≠ q is calculated according to the formula (3)ip csiq) If rel (cs)ip,csiq) Lambda (lambda is a given threshold), the corresponding candidate subject word cw is consideredipAnd cwiqAre semantically identical, will csipAnd csiqAs identical candidate topic word senses, i.e. in the set of candidate topic word senses CSiMiddle deletion csipOr csiq。
Wherein, wordNumlInGlossOfcsipRepresentation csipWordNE ofthe number of words contained in the t-definition, wordNumlGlossOfcsiqRepresentation csiqThe word number contained in the word notation of (1), NumOfOverlaps _ csipcsiqRepresentation csipWordNet explaination of and csiqThe number of identical words in the words contained in the WordNet definition of (1).
Step two: and calculating the characteristic attribute. For the text document t obtained in the step oneiOf a set of candidate topic word senses CSiEach candidate topic sense cs inim(m=1,2,…,|CSi|,|CSiL is CSiThe number of the candidate topic word senses) respectively calculate four characteristic attribute values thereof, namely the frequency tf x idf of the candidate topic word sense appearing in the text document, the average position fo of the candidate topic word sense appearing in the text document for the first time, the cohesion coh between the letter number len contained in the candidate topic word sense and the candidate topic word sense, and the candidate topic word sense csimThe specific calculation formula of the tf × idf, fo and coh attribute values is as follows:
<math><mrow><mi>tf</mi><mo>×</mo><mi>idf</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>im</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>f</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>im</mi></msub><mo>)</mo></mrow><mo>×</mo><mi>log</mi><mfrac><mrow><mo>|</mo><mi>T</mi><mo>|</mo></mrow><mrow><mo>|</mo><mi>T</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>im</mi></msub><mo>)</mo></mrow><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>4</mn><mo>)</mo></mrow></mrow></math>
fo(csim)=Ofirst/|CSi| (5)
<math><mrow><mi>coh</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>im</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>p</mi><mo>=</mo><mn>1</mn><mo>,</mo><mi>p</mi><mo>≠</mo><mi>m</mi></mrow><mrow><mo>|</mo><mi>C</mi><msub><mi>S</mi><mi>i</mi></msub><mo>|</mo></mrow></munderover><mi>rel</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>im</mi></msub><mo>,</mo><msub><mi>cs</mi><mi>ip</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><mi>C</mi><msub><mi>S</mi><mi>i</mi></msub><mo>|</mo><mo>-</mo><mn>1</mn></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>6</mn><mo>)</mo></mrow></mrow></math>
wherein, f (cs)im) For the candidate topic sense csimIn a text document tiThe number of occurrences, | T | is the text document space number in the text document set T, | T (cs)im) | is the inclusion of the candidate topic sense cs in the text document set TimText document length of (1); o isfirstFor the candidate topic sense csimIn a text document tiThe position of the first occurrence in; rel (cs)im,csip) The semantic relatedness between candidate topic word senses calculated for equation (3).
For the text document set E to be extracted, { E ═ E
1,…,e
i,…,e
|E|Each text document in the (i.e. the test text document set) is also processed by the above step one and step two. Wherein for E ═ { E ═ E
1,…,e
i,…,e
|E|Each text document e in
iTo obtain the candidate topic meaning set
And wherein each candidate topic word sense ce
ijFour characteristic attribute values of (1): tf x idf (ce)
ij)、fo(ce
ij)、len(ce
ij) And coh (ce)
ij). Extracting a text document set E-E to be extracted by adopting a Bayesian estimation method in the third step
1,…,e
i,…,e
|E|The subject word senses of (i.e., the set of test text documents). The method specifically comprises the following steps:
step three: and extracting the theme word meaning. Since the topic senses of the training text document set are known, for the text document E in the test text document set EiEach candidate topic sense ceijFirstly, a training text document set T is divided into two types according to whether the training text document set T is the theme word meaning of the training text document: for the text document T in the training text document set TiIf the candidate subject word sense ceijIs tiThe subject word sense of (1), then the text document tiSet of text documents T falling under the first category1(ii) a If the candidate subject word sense ceijIs not tiThe subject word sense of (1), then the text document tiSet T of text documents classified into the second category0. Then, ce is calculated by the following formulaijIn the set T1Average attribute value of
<math><mrow><msup><mover><mrow><mi>tf</mi><mo>×</mo><mi>idf</mi></mrow><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>u</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><mi>tf</mi><mo>×</mo><msubsup><mi>idf</mi><mi>u</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>7</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><msup><mover><mi>fo</mi><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>u</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>fo</mi><mi>u</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>8</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><msup><mover><mi>len</mi><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>u</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>len</mi><mi>u</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>9</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><msup><mover><mi>coh</mi><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>Σ</mi><mrow><mi>u</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>coh</mi><mi>u</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>10</mn><mo>)</mo></mrow></mrow></math>
Wherein,
are respectively ce
ijIn the set T
1The u-th text document t in (1)
uTf × idf, fo, len, coh attribute values in (1);
finally, candidate topic word senses ce are calculated according to the following formulaijIn a text document eiProbability of being the final topic meaning Pr:
Pr=Pr[T|yes]×Pr[O|yes]×Pr[L|yes]×Pr[C|yes]×Pr[yes] (11)
wherein, Pr [ T | yes]、Pr[O|yes]、Pr[L|yes]And Pr [ C ] yes]Respectively representing text documents E in a test text document set EiCe of candidate topic wordijBayesian estimation probability, Pr [ yes ], of subject meaning under the condition of having current characteristic attribute values tf multiplied by idf, fo, len and coh]A ratio representing a number of text documents in the set of training text documents for which the candidate subject word sense is a subject word sense to a number of text documents in the set of training text documents for which the candidate subject word sense is not a subject word sense;
the calculation formulas of Pr [ T | yes ], Pr [ O | yes ], Pr [ L | yes ], Pr [ C | yes ] and Pr [ yes ] are respectively:
<math><mrow><mi>Pr</mi><mo>[</mo><mi>T</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><mi>tf</mi><mo>×</mo><msup><mi>idf</mi><msub><mi>e</mi><mi>i</mi></msub></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>/</mo><msup><mover><mrow><mi>tf</mi><mo>×</mo><mi>idf</mi></mrow><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>12</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>O</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mi>fo</mi><msub><mi>e</mi><mi>i</mi></msub></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>/</mo><msup><mover><mi>fo</mi><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>13</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>L</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mi>len</mi><msub><mi>e</mi><mi>i</mi></msub></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>/</mo><msup><mover><mi>len</mi><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>14</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>C</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mi>coh</mi><msub><mi>e</mi><mi>i</mi></msub></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>/</mo><msup><mover><mi>coh</mi><mo>‾</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>15</mn><mo>)</mo></mrow></mrow></math>
Pr[yes]=|T1|/|T0| (16)
wherein,
are respectively ce
ijText in test text document set EGear e
iTf × idf, fo, len, coh attribute values in (1); i T
1I and I T
0L are respectively the set T
1And T
0The text document space contained therein.
Calculating each text document e in the text document set (i.e. the test document set) to be extracted by adopting the methodiThe probability Pr of all candidate subject word senses in the candidate subject word sense set becoming final subject word senses is sorted from large to small according to the Pr value, and the N candidate main body word senses which are sorted at the front are taken as the extracted text document e according to the requirementiThe subject sense of (1).
Example experiment: we implemented the invention using Java programs and then performed a set of experiments to evaluate the invention in which the threshold λ was set to 0.9. The experimental data is 500 text documents containing subject words randomly downloaded from an online text document database maintained by UN Food and agricultural organizations. These text documents contain an average of 4.95 subject words. 300 text documents were used to train the model and 200 other text documents were used for testing.
Precision, Recall, and integrated F-measure are used to evaluate the subject sense extraction algorithm.
<math><mrow><mi>F</mi><mo>-</mo><mi>measure</mi><mo>=</mo><mfrac><mrow><mn>2</mn><mo>×</mo><mi>Precision</mi><mo>×</mo><mi>Recall</mi></mrow><mrow><mi>Precision</mi><mo>+</mo><mi>Recall</mi></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>19</mn><mo>)</mo></mrow></mrow></math>
Wherein, correct _ extracted _ keywords is the number of correctly extracted topic senses, all _ extracted _ keywords is the number of all the extracted topic senses, and manual _ associated _ keywords is the number of manually assigned topic senses.
Equations (17), (18) and (19) are used to evaluate each text document, and the final Precision, Recall and F-measure are the average values of the entire set of test text documents.
Description figure 2 gives the results of the experiment. The horizontal axis represents the total number of subject senses extracted by the method of the present invention, which ranges from 1 to 20, and the vertical axis represents the average number of correct subject senses extracted. As can be seen from the figure, when the number of extracted total subject word senses is 5, the number of correct subject word senses is about 3, and an accuracy rate of about 60% is achieved; when the number of the extracted total subject word senses is 9, the number of correct subject word senses is about 4, and the accuracy rate of about 80% is achieved; when the number of extracted total subject word senses is 15, the number of correct subject word senses is about 4.5, and the accuracy rate of 90% is achieved. The above analysis shows that the subject word sense extraction method of the invention has better performance.
Sequentially selecting word senses of the top five ranked words from a theme word sense set extracted from each text document; then, Precision, Recall, and F-measure of each text document are calculated using the evaluation formulas (17), (18), and (19); finally, the average of the performance with respect to all text documents was calculated, and the final result is shown in table 1.
TABLE 1 Performance of the subject sense extraction Algorithm
Topic word meaning extraction algorithm |
Pr ecision |
Recall |
F-measure |
5 subject word senses |
0.595 |
0.612 |
0.603 |
As can be seen from evaluation experiments, the method for extracting the theme meaning has better performance and higher accuracy and recall rate, and can be applied to automatic theme meaning extraction of the text document. This is mainly because the present invention uses word senses instead of words for processing, thereby being able to more accurately acquire the subject meaning of a text document. As can be seen from FIG. 2, when the total number of the subject word senses extracted by the algorithm reaches 9, the accuracy can reach 80%, so the method can also be applied to semi-automatic text document subject labeling, a plurality of subject word senses are generated by using the method, and then the screening is performed by a user.