CN101887415A - Automatic extraction method for text document theme word meaning - Google Patents

Automatic extraction method for text document theme word meaning Download PDF

Info

Publication number
CN101887415A
CN101887415A CN 201010210106 CN201010210106A CN101887415A CN 101887415 A CN101887415 A CN 101887415A CN 201010210106 CN201010210106 CN 201010210106 CN 201010210106 A CN201010210106 A CN 201010210106A CN 101887415 A CN101887415 A CN 101887415A
Authority
CN
China
Prior art keywords
mrow
msup
word
text document
msubsup
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 201010210106
Other languages
Chinese (zh)
Other versions
CN101887415B (en
Inventor
方俊
郭雷
常威威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
COMTEC SOLAR (JIANGSU) Co Ltd
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN2010102101066A priority Critical patent/CN101887415B/en
Publication of CN101887415A publication Critical patent/CN101887415A/en
Application granted granted Critical
Publication of CN101887415B publication Critical patent/CN101887415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an automatic extraction method for a text document theme word meaning, which comprises the following steps of: firstly, performing text document preprocessing on a training text document set and a testing text document set to obtain a candidate theme word meaning set of text documents in the training text document set and the testing text document set; then, calculating a characteristic attribute value of each candidate theme word meaning; and finally, extracting a final theme word meaning of each text document in the testing text document set by using a Bayesian model. The whole process for extracting the theme meaning by using word-meaning substituting words avoids inaccuracy caused by polysemy, and the method can improve the extraction precision of the theme meaning.

Description

Automatic extraction method of text document theme word senses
Technical Field
The invention relates to a method for automatically extracting theme word senses of a text document, and belongs to the fields of computer information processing, natural language processing and the like. The method is suitable for quickly and accurately extracting the topics of a large number of text documents.
Background
With the development of the Internet, the increasing speed of the total amount of information increases exponentially, a large amount of information is presented to people in the form of electronic text documents, and an automatic tool is urgently needed to help people to quickly find really needed information from a large amount of information. To achieve this goal, the primary task is to extract the topical meaning of a text document. In addition, the subject meaning can also be applied to many other text mining fields, such as text classification, text clustering, text retrieval and the like. In the most ideal situation, the subject meaning is given artificially, but because of the huge amount of text documents, the artificial giving of the subject meaning of the text documents becomes infeasible, so that the research of the high-performance subject meaning automatic extraction algorithm is very important.
The subject meaning of a text document represents summary information of the text document, and the task of the subject word extraction is to find out words capable of describing the content of the text document from the text document, so that the current research work uses the subject words to represent semantic information of text document resources, and converts the problem of the subject meaning extraction of the text document into the problem of the subject word extraction.
The existing research method is to use the subject word to represent the subject meaning of the text document, and because the difference between the vocabulary level (the word representing the meaning) and the concept level (the meaning itself) means that the same word has different word meanings in different context environments, and different words can also represent the same meaning, this will cause inaccuracy of the subject meaning extraction, which is mainly expressed in the following two aspects:
● the meaning of the subject is inaccurate. Because words have different word senses, if words are used to express subject meanings, the expressed subject meanings may be ambiguous, for example, "mouse" can express mouse or mouse meaning, and confusion will occur when mouse is given to express the subject meaning of a text document;
● inaccuracies in the subject meaning extraction process. In the process of extracting the meaning of the subject, the existing method carries out various operations on the words, wherein the operations comprise counting the occurrence frequency, initial positions and the like of the words in the text document. In these operations, if the word sense is not considered, errors of some operations will be caused, thereby reducing the accuracy of the extraction of the theme meaning.
In order to solve the above problems, the present invention uses word senses instead of words because the word senses have only unique meanings. In the subject word sense extraction algorithm, word senses of candidate subject words are obtained by a disambiguation algorithm, and then the accuracy of the algorithm is improved by considering the correlation degree between the word senses in the steps of word sense combination and extraction.
Disclosure of Invention
Technical problem to be solved
In order to solve the problem that the existing theme meaning extraction algorithm is inaccurate due to the ambiguity of words, the invention provides the method for extracting the theme meaning by using the word meaning to replace the words, and the precision of extracting the theme meaning can be improved.
Technical scheme
The basic idea of the invention is: and converting the candidate subject words in the text document into candidate subject word senses, then extracting the candidate subject word senses, and finally outputting the subject word senses. In the whole process, word meaning is used for extracting the theme meaning instead of words, so that inaccuracy caused by word ambiguity is avoided in the process of expression of the theme meaning and algorithm processing.
The invention is technically characterized in that: converting the candidate subject term into a candidate subject term meaning by utilizing the context information of the candidate subject term and adopting disambiguation technology for subsequent processing; and in the process of extracting the theme meaning, the statistical information and the semantic information are considered at the same time, so that the precision of extracting the theme meaning is improved.
A method for automatically extracting the theme word meaning of a text document is characterized by comprising the following steps:
(1) respectively preprocessing each text document in the training text document set and the test text document set to obtain a candidate topic word meaning set of each text document;
the pretreatment comprises the following steps:
step a: extracting a candidate subject term set of the text document:
firstly, removing numbers and punctuation marks in a text document, and segmenting the text document into a set of existing words;
then, removing words which do not meet the conditions in the set;
finally, converting capital letters in the remaining words into lowercase letters, and removing prefixes and suffixes of the words to obtain a candidate subject word set of the text document;
the conditions are as follows: the number of letters forming a word is less than a preset value, or at least one lower case letter or a non-stop word is formed; the non-stop words refer to all words except stop words, and the stop words are fictional words; the preset value is 15 letters;
step b: adopting a disambiguation algorithm to obtain a candidate subject word sense set of the text document:
firstly, selecting words within a range of W from each candidate subject word in a candidate subject word set as the context of the candidate subject word; the value range of W is [6, 10 ];
then, a formula is calculated according to semantic relevance
Figure BSA00000175174800031
Computing the kth possible sense s for each candidate subject termkAnd the ith context c of the candidate subject wordiSemantic relatedness of rel(s)k,ci) And press
Figure BSA00000175174800032
Computing the kth possible sense s of a candidate subject wordkTotal semantic relevance SenseCore(s) to all contexts of the candidate subject termk);
Wherein K is 1, 2, …, K is the number of possible meanings of the candidate subject word; i is 1, 2, …, and I is the number of the contexts of the candidate subject term; wordnumingloss OfskDenotes skWordN ofThe number of words contained in the et definition, wordNumlGlossOfciDenotes ciThe word number contained in the word notation of (1), NumOfOverlaps _ skciDenotes skWordNet definition of and ciThe number of the same word in the words contained in the WordNet paraphrase of (1); the possible word senses are defined in a lexical database WordNet;
finally, selecting the possible word sense with the maximum total semantic relevance SenseScore value as a candidate subject word sense of the candidate subject word to obtain a candidate subject word sense set of the text document;
step c: merging candidate topic senses:
formula for calculating according to semantic relevance
Figure BSA00000175174800041
Computing any two candidate topic word senses in a set of candidate topic word sensesAnd
Figure BSA00000175174800043
removing any one of the two candidate topic word senses with the semantic relevance value larger than a given threshold lambda; the value range of the threshold lambda is [0.5, 0.8 ]];
Wherein,
Figure BSA00000175174800044
Figure BSA00000175174800045
p≠q,
Figure BSA00000175174800046
the number of candidate topic word senses in the candidate topic word sense set;
Figure BSA00000175174800047
to representThe number of words contained in the WordNet paraphrase of (1),
Figure BSA00000175174800049
to representThe number of words contained in the WordNet paraphrase of (1),to representThe word DNA of
Figure BSA000001751748000413
The number of the same word in the words contained in the WordNet paraphrase of (1);
(2) calculating the characteristic attribute value of each candidate topic word sense in the candidate topic word sense set in the text document; the characteristic attributes comprise: the frequency tf x idf of occurrence of the candidate topic word sense in the text document, the average position fo of the first occurrence of the candidate topic word sense in the text document, the number of letters len contained in the candidate topic word sense, and the cohesiveness coh between the candidate topic word senses;
the calculation formula of the frequency tf multiplied by idf of the candidate topic word senses appearing in the text document is as follows:
<math><mrow><mi>tf</mi><mo>&times;</mo><mi>idf</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>f</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>&times;</mo><mi>log</mi><mfrac><mrow><mo>|</mo><mi>D</mi><mo>|</mo></mrow><mrow><mo>|</mo><mi>D</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>|</mo></mrow></mfrac></mrow></math>
wherein cs isjJ is the jth candidate topic word sense in the candidate topic word sense set of the text document, wherein J is 1, 2, …, J is the number of candidate topic word senses in the candidate topic word sense set of the text document; f (cs)j) Is csjThe number of occurrences in a text document, D represents a set of text documents, | D | is the text document space in D, | D (cs)j) I is the inclusion of a candidate topic meaning cs in DjText document length of (1);
the calculation formula of the average position fo of the candidate topic word senses appearing for the first time in the text document is as follows:
fo(csj)=Ofirst/J
wherein, OfirstAs candidate topic senses csjA location of a first occurrence in a text document;
the calculation formula of the cohesiveness coh among the candidate topic word senses is as follows:
<math><mrow><mi>coh</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>l</mi><mo>=</mo><mn>1</mn><mo>,</mo><mi>l</mi><mo>&NotEqual;</mo><mi>j</mi></mrow><mi>J</mi></munderover><mi>rel</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>,</mo><msub><mi>cs</mi><mi>l</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>J</mi><mo>-</mo><mn>1</mn></mrow></mfrac></mrow></math>
wherein rel (cs)j,csl) For calculating formulas according to semantic relevance
Figure BSA00000175174800052
Candidate topic sense cs in candidate topic sense set of text document obtained by calculationjAnd candidate topic sense cslThe semantic relatedness of (c); wordNumInGlossOfcsjRepresentation csjThe word number contained in the WordNet definition of (1), wordNumlingGlossOfcslRepresentation cslThe word number contained in the word notation of (1), NumOfOverlaps _ csjcslRepresentation csjWordNet explaination of and cslThe number of the same word in the words contained in the WordNet paraphrase of (1);
(3) extracting a final theme word meaning set of each text document in the test text document set by using a Bayesian model:
firstly, calculating the probability Pr of each candidate topic word sense of each text document in the test text document set as the topic word sense according to Pr ═ Pr [ T | yes ] × Pr [ O | yes ] × Pr [ L | yes ] × Pr [ C | yes ] × Pr [ yes ];
wherein Pr [ T | yes ], Pr [ O | yes ], Pr [ L | yes ] and Pr [ C | yes ] respectively represent the probability that the candidate subject word senses are subject word senses under the condition of having the characteristic attribute values of tf x idf, fo, len and coh, and Pr [ yes ] represents the proportion of the number of the text documents in the training text document set with the candidate subject word sense as the subject word sense to the number of the text documents in the training text document set with the candidate subject word sense not as the subject word sense;
then, sequencing all candidate subject word senses of the text document from large to small according to the probability Pr value;
finally, the user-set number of candidate topic word senses ranked first are selected to form the final topic word sense set for the text document.
The calculation formulas of Pr [ T | yes ], Pr [ O | yes ], Pr [ L | yes ], Pr [ C | yes ] and Pr [ yes ] are respectively:
<math><mrow><mi>Pr</mi><mo>[</mo><mi>T</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mrow><mi>tf</mi><mo>&times;</mo><mi>idf</mi></mrow><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>/</mo><mi>tf</mi><mo>&times;</mo><msup><mi>idf</mi><msup><mi>d</mi><mo>&prime;</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>O</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mi>fo</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>/</mo><msup><mi>fo</mi><msup><mi>d</mi><mo>&prime;</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>L</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mi>len</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>/</mo><msup><mi>len</mi><msup><mi>d</mi><mo>&prime;</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>C</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mi>coh</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>/</mo><msup><mi>coh</mi><msup><mi>d</mi><mo>&prime;</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow></math>
Pr [ yes ] = | T 1 | | T 0 |
wherein d 'is a text document in the test text document set, cs'mThe M-th candidate topic word sense of the text document d 'is defined, wherein M is 1, 2, …, and M is the number of candidate topic word senses in the candidate topic word sense set of the text document d'; tf x idfd′(cs′m)、fod′(cs′m)、lend′(cs′m) And cohd′(cs′m) Are respectively candidate topic word sense cs'mTf x idf, fo, len, coh feature attribute values in the text document d';
Figure BSA00000175174800066
and
Figure BSA00000175174800067
are respectively cs'mAverage tf × idf, fo, len, coh feature attribute values in the text document set T1; the text document set T1Is the candidate topic word sense cs 'from the set of training text documents'mA set of text documents that are subject word senses; the text document set T0Is the candidate topic word sense cs 'from the set of training text documents'mA collection of text documents that are not subject word senses;
and
Figure BSA00000175174800069
the calculation formulas of (A) and (B) are respectively as follows:
<math><mrow><msup><mover><mrow><mi>tf</mi><mo>&times;</mo><mi>idf</mi></mrow><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><mi>tf</mi><mo>&times;</mo><msubsup><mi>idf</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>
<math><mrow><msup><mover><mi>fo</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>fo</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>
<math><mrow><msup><mover><mi>len</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>len</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>
<math><mrow><msup><mover><mi>coh</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>coh</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>
wherein,
Figure BSA000001751748000614
and
Figure BSA000001751748000615
are respectively candidate topic word sense cs'mIn a text document set T1And the characteristic attribute values of tf multiplied by idf, fo, len and coh in the nth text document.
Has the advantages that:
the invention provides an automatic extraction method of text document theme word senses, which uses word senses to replace words for processing, and solves the problems of inaccurate expression of theme meaning and misoperation in the extraction process caused by word ambiguity, thereby improving the accuracy of an algorithm. In addition, in the extraction process, the invention considers statistical information (Bayesian estimation probability) and semantic information (word meaning) at the same time, thereby further improving the accuracy of the algorithm.
Drawings
FIG. 1: basic flow diagram of the method of the invention
FIG. 2: experimental result chart for extracting subject word meaning by using method of the invention
Detailed Description
Given a set of training text documents T ═ T1,…,t|T|And a set of text documents to be extracted (test text document set) E ═ E { (E)1,…,e|E|And processing each text document in the T and the E according to the following steps I and II, specifically:
the method comprises the following steps: and preprocessing the text document. For text document T in Ti(i ═ 1, …, | T |, is the text document number in the text document set T), first use step 1.1 to get the candidate subject word of the text document, then use step 1.2 to get the candidate subject word meaning, finally use step 1.3 to merge the candidate subject word meaning, get the text document TiThe final set of candidate topic word senses.
Step 1.1: and acquiring candidate subject terms. First, the text document t is removediThe numbers and various punctuation marks in (a), represent the text document as a collection of words: t is ti={w1,…,wij… }; then, for each word w in the set of wordsijThe invention adopts the following rules to judge whether the candidate subject term is a candidate subject term: if the composition wijIs greater than a predetermined value L (here, L ═ 15), or constitutes wijAll the letters of (A) are capital letters, or wijFor stop words (i.e. imaginary words including articles, pronouns, etc.), then wijCan not be candidate subject words, willIt is from the set w1,…,wij… } is removed; finally, set { w1,…,wij…, and removing the suffix from the words, i.e. each candidate subject word is represented in root form, to obtain the text document tiIs selected from the candidate subject word set CWi={cw1,…,cwij,…}。
Step 1.2: candidate topic word senses are obtained. For text documents tiIs selected from the candidate subject word set CWi={cw1,…,cwij… } of the candidate subject word cwij(j=1,…,|CWi|,|CWiI is candidate subject word set CWiThe number of candidate subject terms in the text document t) is obtained by the disambiguation algorithm in the inventioniThe correct sense in (1).
First, in CWiIn (1), cw is selectedijAll words in the W distance range of (1) are context, giving cwijContext set of(|CijI is a context set CijThe number of words in) and
Figure BSA00000175174800082
is the candidate subject word cwijIn which | SijI is a set of possible word senses SijThe number of possible word senses, where a possible word sense is a word sense of a candidate subject word defined in the lexical database WordNet; then, the candidate subject word cw is calculated as followsijOf the kth possible sense sijkWith its l context cijlSemantic relatedness of rel(s)ijk,cijl):
rel ( s ijk , c ijl ) = NumOfOverlaps _ s ijk c ijl ( wordNumInGlossOfs ijk + wordNumInGlossOfc ijl ) / 2 - - - ( 1 )
Wherein, wordNumlInGlossOfsijkDenotes sijkThe word number contained in the word notation of (1), word NumlInGlossOfcijlDenotes cijlThe word number contained in the word notation of (1), NumOfOverlaps _ sijkcijlDenotes sijkWordNet definition of and cijlThe number of the same word in the words contained in the WordNet paraphrase of (1);
then, each possible word sense s is obtained according to the following formulaijkAnd all the contexts c in the context setijl(l=1,…,|CijOverall semantic relatedness SenseScore(s)ijk):
<math><mrow><mi>SenseScore</mi><mrow><mo>(</mo><msub><mi>s</mi><mi>ijk</mi></msub><mo>)</mo></mrow><mo>=</mo><munderover><mi>&Sigma;</mi><mrow><mi>l</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msub><mi>C</mi><mi>ij</mi></msub><mo>|</mo></mrow></munderover><mi>rel</mi><mrow><mo>(</mo><msub><mi>s</mi><mi>ijk</mi></msub><mo>,</mo><msub><mi>c</mi><mi>ijl</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>2</mn><mo>)</mo></mrow></mrow></math>
Finally, selecting the possible word meaning with the maximum total semantic relevance SenseScore value as the candidate subject word cwijOf correct sense, i.e. candidate subject word cwijThe candidate topic senses of (1).
The text document t is obtained by calculation by adopting the methodiIs selected from the candidate subject word set CWi={cw1,…,cwij… } of all candidate subject words cwij(j=1,2,…,|CWi|) candidate topic word senses to form a text document tiIs recorded as the set of candidate topic word senses
Figure BSA00000175174800091
Wherein, | CSiL is set CSiThe number of candidate topic senses in.
Step 1.3: merging candidate topic senses. For a set of candidate topic word senses CSiAny two candidate topic senses cs inipAnd csiq(p,q=1,2,…,|CSiL, p ≠ q), and the semantic relevance rel (cs) between the l, p ≠ q is calculated according to the formula (3)ip csiq) If rel (cs)ip,csiq) Lambda (lambda is a given threshold), the corresponding candidate subject word cw is consideredipAnd cwiqAre semantically identical, will csipAnd csiqAs identical candidate topic word senses, i.e. in the set of candidate topic word senses CSiMiddle deletion csipOr csiq
rel ( cs ip , cs iq ) = NumOfOverlaps _ cs ip cs iq ( wordNumInGlossOfcs ip + wordNumInGlossOfcs iq ) / 2 - - - ( 3 )
Wherein, wordNumlInGlossOfcsipRepresentation csipWordNE ofthe number of words contained in the t-definition, wordNumlGlossOfcsiqRepresentation csiqThe word number contained in the word notation of (1), NumOfOverlaps _ csipcsiqRepresentation csipWordNet explaination of and csiqThe number of identical words in the words contained in the WordNet definition of (1).
Step two: and calculating the characteristic attribute. For the text document t obtained in the step oneiOf a set of candidate topic word senses CSiEach candidate topic sense cs inim(m=1,2,…,|CSi|,|CSiL is CSiThe number of the candidate topic word senses) respectively calculate four characteristic attribute values thereof, namely the frequency tf x idf of the candidate topic word sense appearing in the text document, the average position fo of the candidate topic word sense appearing in the text document for the first time, the cohesion coh between the letter number len contained in the candidate topic word sense and the candidate topic word sense, and the candidate topic word sense csimThe specific calculation formula of the tf × idf, fo and coh attribute values is as follows:
<math><mrow><mi>tf</mi><mo>&times;</mo><mi>idf</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>im</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>f</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>im</mi></msub><mo>)</mo></mrow><mo>&times;</mo><mi>log</mi><mfrac><mrow><mo>|</mo><mi>T</mi><mo>|</mo></mrow><mrow><mo>|</mo><mi>T</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>im</mi></msub><mo>)</mo></mrow><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>4</mn><mo>)</mo></mrow></mrow></math>
fo(csim)=Ofirst/|CSi| (5)
<math><mrow><mi>coh</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>im</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>p</mi><mo>=</mo><mn>1</mn><mo>,</mo><mi>p</mi><mo>&NotEqual;</mo><mi>m</mi></mrow><mrow><mo>|</mo><mi>C</mi><msub><mi>S</mi><mi>i</mi></msub><mo>|</mo></mrow></munderover><mi>rel</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>im</mi></msub><mo>,</mo><msub><mi>cs</mi><mi>ip</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><mi>C</mi><msub><mi>S</mi><mi>i</mi></msub><mo>|</mo><mo>-</mo><mn>1</mn></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>6</mn><mo>)</mo></mrow></mrow></math>
wherein, f (cs)im) For the candidate topic sense csimIn a text document tiThe number of occurrences, | T | is the text document space number in the text document set T, | T (cs)im) | is the inclusion of the candidate topic sense cs in the text document set TimText document length of (1); o isfirstFor the candidate topic sense csimIn a text document tiThe position of the first occurrence in; rel (cs)im,csip) The semantic relatedness between candidate topic word senses calculated for equation (3).
For the text document set E to be extracted, { E ═ E1,…,ei,…,e|E|Each text document in the (i.e. the test text document set) is also processed by the above step one and step two. Wherein for E ═ { E ═ E1,…,ei,…,e|E|Each text document e iniTo obtain the candidate topic meaning set
Figure BSA00000175174800101
And wherein each candidate topic word sense ceijFour characteristic attribute values of (1): tf x idf (ce)ij)、fo(ceij)、len(ceij) And coh (ce)ij). Extracting a text document set E-E to be extracted by adopting a Bayesian estimation method in the third step1,…,ei,…,e|E|The subject word senses of (i.e., the set of test text documents). The method specifically comprises the following steps:
step three: and extracting the theme word meaning. Since the topic senses of the training text document set are known, for the text document E in the test text document set EiEach candidate topic sense ceijFirstly, a training text document set T is divided into two types according to whether the training text document set T is the theme word meaning of the training text document: for the text document T in the training text document set TiIf the candidate subject word sense ceijIs tiThe subject word sense of (1), then the text document tiSet of text documents T falling under the first category1(ii) a If the candidate subject word sense ceijIs not tiThe subject word sense of (1), then the text document tiSet T of text documents classified into the second category0. Then, ce is calculated by the following formulaijIn the set T1Average attribute value of
<math><mrow><msup><mover><mrow><mi>tf</mi><mo>&times;</mo><mi>idf</mi></mrow><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>u</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><mi>tf</mi><mo>&times;</mo><msubsup><mi>idf</mi><mi>u</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>7</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><msup><mover><mi>fo</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>u</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>fo</mi><mi>u</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>8</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><msup><mover><mi>len</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>u</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>len</mi><mi>u</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>9</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><msup><mover><mi>coh</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>u</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>coh</mi><mi>u</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>10</mn><mo>)</mo></mrow></mrow></math>
Wherein,
Figure BSA00000175174800111
are respectively ceijIn the set T1The u-th text document t in (1)uTf × idf, fo, len, coh attribute values in (1);
finally, candidate topic word senses ce are calculated according to the following formulaijIn a text document eiProbability of being the final topic meaning Pr:
Pr=Pr[T|yes]×Pr[O|yes]×Pr[L|yes]×Pr[C|yes]×Pr[yes] (11)
wherein, Pr [ T | yes]、Pr[O|yes]、Pr[L|yes]And Pr [ C ] yes]Respectively representing text documents E in a test text document set EiCe of candidate topic wordijBayesian estimation probability, Pr [ yes ], of subject meaning under the condition of having current characteristic attribute values tf multiplied by idf, fo, len and coh]A ratio representing a number of text documents in the set of training text documents for which the candidate subject word sense is a subject word sense to a number of text documents in the set of training text documents for which the candidate subject word sense is not a subject word sense;
the calculation formulas of Pr [ T | yes ], Pr [ O | yes ], Pr [ L | yes ], Pr [ C | yes ] and Pr [ yes ] are respectively:
<math><mrow><mi>Pr</mi><mo>[</mo><mi>T</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><mi>tf</mi><mo>&times;</mo><msup><mi>idf</mi><msub><mi>e</mi><mi>i</mi></msub></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>/</mo><msup><mover><mrow><mi>tf</mi><mo>&times;</mo><mi>idf</mi></mrow><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>12</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>O</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mi>fo</mi><msub><mi>e</mi><mi>i</mi></msub></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>/</mo><msup><mover><mi>fo</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>13</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>L</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mi>len</mi><msub><mi>e</mi><mi>i</mi></msub></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>/</mo><msup><mover><mi>len</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>14</mn><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>C</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mi>coh</mi><msub><mi>e</mi><mi>i</mi></msub></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>/</mo><msup><mover><mi>coh</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msub><mi>ce</mi><mi>ij</mi></msub><mo>)</mo></mrow><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>15</mn><mo>)</mo></mrow></mrow></math>
Pr[yes]=|T1|/|T0| (16)
wherein,
Figure BSA00000175174800116
are respectively ceijText in test text document set EGear eiTf × idf, fo, len, coh attribute values in (1); i T1I and I T0L are respectively the set T1And T0The text document space contained therein.
Calculating each text document e in the text document set (i.e. the test document set) to be extracted by adopting the methodiThe probability Pr of all candidate subject word senses in the candidate subject word sense set becoming final subject word senses is sorted from large to small according to the Pr value, and the N candidate main body word senses which are sorted at the front are taken as the extracted text document e according to the requirementiThe subject sense of (1).
Example experiment: we implemented the invention using Java programs and then performed a set of experiments to evaluate the invention in which the threshold λ was set to 0.9. The experimental data is 500 text documents containing subject words randomly downloaded from an online text document database maintained by UN Food and agricultural organizations. These text documents contain an average of 4.95 subject words. 300 text documents were used to train the model and 200 other text documents were used for testing.
Precision, Recall, and integrated F-measure are used to evaluate the subject sense extraction algorithm.
Precision = correct _ extracted _ keywords all _ extracted _ keywords - - - ( 17 )
Recall = correct _ extracted _ keywords manually _ assigned _ keywords - - - ( 18 )
<math><mrow><mi>F</mi><mo>-</mo><mi>measure</mi><mo>=</mo><mfrac><mrow><mn>2</mn><mo>&times;</mo><mi>Precision</mi><mo>&times;</mo><mi>Recall</mi></mrow><mrow><mi>Precision</mi><mo>+</mo><mi>Recall</mi></mrow></mfrac><mo>-</mo><mo>-</mo><mo>-</mo><mrow><mo>(</mo><mn>19</mn><mo>)</mo></mrow></mrow></math>
Wherein, correct _ extracted _ keywords is the number of correctly extracted topic senses, all _ extracted _ keywords is the number of all the extracted topic senses, and manual _ associated _ keywords is the number of manually assigned topic senses.
Equations (17), (18) and (19) are used to evaluate each text document, and the final Precision, Recall and F-measure are the average values of the entire set of test text documents.
Description figure 2 gives the results of the experiment. The horizontal axis represents the total number of subject senses extracted by the method of the present invention, which ranges from 1 to 20, and the vertical axis represents the average number of correct subject senses extracted. As can be seen from the figure, when the number of extracted total subject word senses is 5, the number of correct subject word senses is about 3, and an accuracy rate of about 60% is achieved; when the number of the extracted total subject word senses is 9, the number of correct subject word senses is about 4, and the accuracy rate of about 80% is achieved; when the number of extracted total subject word senses is 15, the number of correct subject word senses is about 4.5, and the accuracy rate of 90% is achieved. The above analysis shows that the subject word sense extraction method of the invention has better performance.
Sequentially selecting word senses of the top five ranked words from a theme word sense set extracted from each text document; then, Precision, Recall, and F-measure of each text document are calculated using the evaluation formulas (17), (18), and (19); finally, the average of the performance with respect to all text documents was calculated, and the final result is shown in table 1.
TABLE 1 Performance of the subject sense extraction Algorithm
Topic word meaning extraction algorithm Pr ecision Recall F-measure
5 subject word senses 0.595 0.612 0.603
As can be seen from evaluation experiments, the method for extracting the theme meaning has better performance and higher accuracy and recall rate, and can be applied to automatic theme meaning extraction of the text document. This is mainly because the present invention uses word senses instead of words for processing, thereby being able to more accurately acquire the subject meaning of a text document. As can be seen from FIG. 2, when the total number of the subject word senses extracted by the algorithm reaches 9, the accuracy can reach 80%, so the method can also be applied to semi-automatic text document subject labeling, a plurality of subject word senses are generated by using the method, and then the screening is performed by a user.

Claims (2)

1. A method for automatically extracting the theme word meaning of a text document is characterized by comprising the following steps:
(1) respectively preprocessing each text document in the training text document set and the test text document set to obtain a candidate topic word meaning set of each text document;
the pretreatment comprises the following steps:
step a: extracting a candidate subject term set of the text document:
firstly, removing numbers and punctuation marks in a text document, and segmenting the text document into a set of existing words;
then, removing words which do not meet the conditions in the set;
finally, converting capital letters in the remaining words into lowercase letters, and removing prefixes and suffixes of the words to obtain a candidate subject word set of the text document;
the conditions are as follows: the number of letters forming a word is less than a preset value, or at least one lower case letter or a non-stop word is formed; the non-stop words refer to all words except stop words, and the stop words are fictional words; the preset value is 15 letters;
step b: adopting a disambiguation algorithm to obtain a candidate subject word sense set of the text document:
firstly, selecting words within a range of W from each candidate subject word in a candidate subject word set as the context of the candidate subject word; the value range of W is [6, 10 ];
then, a formula is calculated according to semantic relevance
Figure FSA00000175174700011
Computing the kth possible sense s for each candidate subject termkAnd the ith context c of the candidate subject wordiSemantic relatedness of rel(s)k,ci) And press
Figure FSA00000175174700012
Computing the kth possible sense s of a candidate subject wordkTotal semantic relevance SenseCore(s) to all contexts of the candidate subject termk);
Wherein K is 1, 2, …, K is the number of possible meanings of the candidate subject word; i is 1, 2, …, and I is the number of the contexts of the candidate subject term; wordnumingloss OfskDenotes skThe word number contained in the word notation of (1), word NumlInGlossOfciDenotes ciThe word number contained in the word notation of (1), NumOfOverlaps _ skciDenotes skWordNet definition of and ciWord definition ofThe number of the same words in the contained words; the possible word senses are defined in a lexical database WordNet;
finally, selecting the possible word sense with the maximum total semantic relevance SenseScore value as a candidate subject word sense of the candidate subject word to obtain a candidate subject word sense set of the text document;
step c: merging candidate topic senses:
formula for calculating according to semantic relevance
Figure FSA00000175174700021
Computing any two candidate topic word senses in a set of candidate topic word senses
Figure FSA00000175174700022
Andremoving any one of the two candidate topic word senses with the semantic relevance value larger than a given threshold lambda; the value range of the threshold lambda is [0.5, 0.8 ]];
Wherein,
Figure FSA00000175174700024
Figure FSA00000175174700025
p≠q,
Figure FSA00000175174700026
the number of candidate topic word senses in the candidate topic word sense set;
Figure FSA00000175174700027
to represent
Figure FSA00000175174700028
The number of words contained in the WordNet paraphrase of (1),
Figure FSA00000175174700029
to represent
Figure FSA000001751747000210
The number of words contained in the WordNet paraphrase of (1),
Figure FSA000001751747000211
to representThe word DNA ofThe number of the same word in the words contained in the WordNet paraphrase of (1);
(2) calculating the characteristic attribute value of each candidate topic word sense in the candidate topic word sense set in the text document; the characteristic attributes comprise: the frequency tf x idf of occurrence of the candidate topic word sense in the text document, the average position fo of the first occurrence of the candidate topic word sense in the text document, the number of letters len contained in the candidate topic word sense, and the cohesiveness coh between the candidate topic word senses;
the calculation formula of the frequency tf multiplied by idf of the candidate topic word senses appearing in the text document is as follows:
<math><mrow><mi>tf</mi><mo>&times;</mo><mi>idf</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><mi>f</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>&times;</mo><mi>log</mi><mfrac><mrow><mo>|</mo><mi>D</mi><mo>|</mo></mrow><mrow><mo>|</mo><mi>D</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>|</mo></mrow></mfrac></mrow></math>
wherein cs isjIs the jth candidate topic word sense in the candidate topic word sense set of the text document, J is 1, 2, …, J, J is the candidate main word of the text documentThe number of candidate subject word senses in the term sense set; f (cs)j) Is csjThe number of occurrences in a text document, D represents a set of text documents, | D | is the text document space in D, | D (cs)j) I is the inclusion of a candidate topic meaning cs in DjText document length of (1);
the calculation formula of the average position fo of the candidate topic word senses appearing for the first time in the text document is as follows:
fo(csj)=Ofirst/J
wherein, OfirstAs candidate topic senses csjA location of a first occurrence in a text document;
the calculation formula of the cohesiveness coh among the candidate topic word senses is as follows:
<math><mrow><mi>coh</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>l</mi><mo>=</mo><mn>1</mn><mo>,</mo><mi>l</mi><mo>&NotEqual;</mo><mi>j</mi></mrow><mi>J</mi></munderover><mi>rel</mi><mrow><mo>(</mo><msub><mi>cs</mi><mi>j</mi></msub><mo>,</mo><msub><mi>cs</mi><mi>l</mi></msub><mo>)</mo></mrow></mrow><mrow><mi>J</mi><mo>-</mo><mn>1</mn></mrow></mfrac></mrow></math>
wherein rel (cs)j,csl) For calculating formulas according to semantic relevance
Figure FSA00000175174700033
Candidate topic sense cs in candidate topic sense set of text document obtained by calculationjAnd candidate topic sense cslThe semantic relatedness of (c); wordNumInGlossOfcsjRepresentation csjThe word number contained in the WordNet definition of (1), wordNumlingGlossOfcslRepresentation cslThe word number contained in the word notation of (1), NumOfOverlaps _ csjcslRepresentation csjWordNet explaination of and cslIncluded in the WordNet definition ofThe number of the same words in the words;
(3) extracting a final theme word meaning set of each text document in the test text document set by using a Bayesian model:
firstly, calculating the probability Pr of each candidate topic word sense of each text document in the test text document set as the topic word sense according to Pr ═ Pr [ T | yes ] × Pr [ O | yes ] × Pr [ L | yes ] × Pr [ C | yes ] × Pr [ yes ];
wherein Pr [ T | yes ], Pr [ O | yes ], Pr [ L | yes ] and Pr [ C | yes ] respectively represent the probability that the candidate subject word senses are subject word senses under the condition of having the characteristic attribute values of tf x idf, fo, len and coh, and Pr [ yes ] represents the proportion of the number of the text documents in the training text document set with the candidate subject word sense as the subject word sense to the number of the text documents in the training text document set with the candidate subject word sense not as the subject word sense;
then, sequencing all candidate subject word senses of the text document from large to small according to the probability Pr value;
finally, the user-set number of candidate topic word senses ranked first are selected to form the final topic word sense set for the text document.
2. The method of claim 1, wherein the method comprises the following steps: the calculation formulas of Pr [ T | yes ], Pr [ O | yes ], Pr [ L | yes ], Pr [ C | yes ] and Pr [ yes ] are respectively:
<math><mrow><mi>Pr</mi><mo>[</mo><mi>T</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mrow><mi>tf</mi><mo>&times;</mo><mi>idf</mi></mrow><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>/</mo><mi>tf</mi><mo>&times;</mo><msup><mi>idf</mi><msup><mi>d</mi><mo>&prime;</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>O</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mi>fo</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>/</mo><msup><mi>fo</mi><msup><mi>d</mi><mo>&prime;</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>L</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mi>len</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>/</mo><msup><mi>len</mi><msup><mi>d</mi><mo>&prime;</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow></math>
<math><mrow><mi>Pr</mi><mo>[</mo><mi>C</mi><mo>|</mo><mi>yes</mi><mo>]</mo><mo>=</mo><msup><mover><mi>coh</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>/</mo><msup><mi>coh</mi><msup><mi>d</mi><mo>&prime;</mo></msup></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow></math>
Pr [ yes ] = | T 1 | | T 0 |
wherein d' is a text document in the test text document set,cs′mthe M-th candidate topic word sense of the text document d 'is defined, wherein M is 1, 2, …, and M is the number of candidate topic word senses in the candidate topic word sense set of the text document d'; tf x idfd′(cs′m)、fod′(cs′m)、lend′(cs′m) And cohd′(cs′m) Are respectively candidate topic word sense cs'mTf x idf, fo, len, coh feature attribute values in the text document d';
Figure FSA00000175174700046
and
Figure FSA00000175174700047
are respectively cs'mIn a text document set T1The average tf x idf, fo, len, coh characteristic attribute values in (1); the text document set T1Is the candidate topic word sense cs 'from the set of training text documents'mA set of text documents that are subject word senses; the text document set T0Is the candidate topic word sense cs 'from the set of training text documents'mA collection of text documents that are not subject word senses;
Figure FSA00000175174700051
and
Figure FSA00000175174700052
the calculation formulas of (A) and (B) are respectively as follows:
<math><mrow><msup><mover><mrow><mi>tf</mi><mo>&times;</mo><mi>idf</mi></mrow><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><mi>tf</mi><mo>&times;</mo><msubsup><mi>idf</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>
<math><mrow><msup><mover><mi>fo</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>fo</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>
<math><mrow><msup><mover><mi>len</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>len</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>
<math><mrow><msup><mover><mi>coh</mi><mo>&OverBar;</mo></mover><mn>1</mn></msup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow><mo>=</mo><mfrac><mrow><munderover><mi>&Sigma;</mi><mrow><mi>n</mi><mo>=</mo><mn>1</mn></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></munderover><msubsup><mi>coh</mi><mi>n</mi><mn>1</mn></msubsup><mrow><mo>(</mo><msubsup><mi>cs</mi><mi>m</mi><mo>&prime;</mo></msubsup><mo>)</mo></mrow></mrow><mrow><mo>|</mo><msup><mi>T</mi><mn>1</mn></msup><mo>|</mo></mrow></mfrac></mrow></math>
wherein,
Figure FSA00000175174700057
and
Figure FSA00000175174700058
are respectively candidate topic word sense cs'mIn a text document set T1And the characteristic attribute values of tf multiplied by idf, fo, len and coh in the nth text document.
CN2010102101066A 2010-06-24 2010-06-24 Automatic extraction method for text document theme word meaning Active CN101887415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102101066A CN101887415B (en) 2010-06-24 2010-06-24 Automatic extraction method for text document theme word meaning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102101066A CN101887415B (en) 2010-06-24 2010-06-24 Automatic extraction method for text document theme word meaning

Publications (2)

Publication Number Publication Date
CN101887415A true CN101887415A (en) 2010-11-17
CN101887415B CN101887415B (en) 2012-05-23

Family

ID=43073341

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102101066A Active CN101887415B (en) 2010-06-24 2010-06-24 Automatic extraction method for text document theme word meaning

Country Status (1)

Country Link
CN (1) CN101887415B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103455487A (en) * 2012-05-29 2013-12-18 腾讯科技(深圳)有限公司 Extracting method and device for search term
CN106484920A (en) * 2016-11-21 2017-03-08 北京恒华伟业科技股份有限公司 A kind of abstracting method of evaluation document index
CN107729480A (en) * 2017-10-16 2018-02-23 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device of limited area
CN108512873A (en) * 2017-02-27 2018-09-07 中国科学院沈阳自动化研究所 A kind of grouping semantic messages filtering of distributed ad-hoc structure and method for routing
CN108920454A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of theme phrase extraction method
CN110020153A (en) * 2017-11-30 2019-07-16 北京搜狗科技发展有限公司 A kind of searching method and device
CN110209941B (en) * 2019-06-03 2021-01-15 北京卡路里信息技术有限公司 Method for maintaining push content pool, push method, device, medium and server
CN112307251A (en) * 2019-06-24 2021-02-02 上海松鼠课堂人工智能科技有限公司 Self-adaptive recognition correlation system and method for knowledge point atlas of English vocabulary

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
《计算机工程与应用》 20050131 王萌等 基于概念向量空间模型的中文自动文摘系统 107-110 1-2 , 第1期 2 *
《计算机科学》 20080630 方俊等 基于语义的关键词提取算法 148-151 1-2 第35卷, 第6期 2 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103455487A (en) * 2012-05-29 2013-12-18 腾讯科技(深圳)有限公司 Extracting method and device for search term
CN103455487B (en) * 2012-05-29 2018-07-06 腾讯科技(深圳)有限公司 The extracting method and device of a kind of search term
CN106484920A (en) * 2016-11-21 2017-03-08 北京恒华伟业科技股份有限公司 A kind of abstracting method of evaluation document index
CN108512873A (en) * 2017-02-27 2018-09-07 中国科学院沈阳自动化研究所 A kind of grouping semantic messages filtering of distributed ad-hoc structure and method for routing
CN108512873B (en) * 2017-02-27 2020-02-04 中国科学院沈阳自动化研究所 Packet semantic message filtering and routing method of distributed self-organizing structure
CN107729480A (en) * 2017-10-16 2018-02-23 北京神州泰岳软件股份有限公司 A kind of Text Information Extraction method and device of limited area
CN107729480B (en) * 2017-10-16 2020-06-26 中科鼎富(北京)科技发展有限公司 Text information extraction method and device for limited area
CN110020153A (en) * 2017-11-30 2019-07-16 北京搜狗科技发展有限公司 A kind of searching method and device
CN108920454A (en) * 2018-06-13 2018-11-30 北京信息科技大学 A kind of theme phrase extraction method
CN110209941B (en) * 2019-06-03 2021-01-15 北京卡路里信息技术有限公司 Method for maintaining push content pool, push method, device, medium and server
CN112307251A (en) * 2019-06-24 2021-02-02 上海松鼠课堂人工智能科技有限公司 Self-adaptive recognition correlation system and method for knowledge point atlas of English vocabulary
CN112307251B (en) * 2019-06-24 2021-08-20 上海松鼠课堂人工智能科技有限公司 Self-adaptive recognition correlation system and method for knowledge point atlas of English vocabulary

Also Published As

Publication number Publication date
CN101887415B (en) 2012-05-23

Similar Documents

Publication Publication Date Title
CN101887415B (en) Automatic extraction method for text document theme word meaning
CN107220295B (en) Searching and mediating strategy recommendation method for human-human contradiction mediating case
CN106649260B (en) Product characteristic structure tree construction method based on comment text mining
CN104063387B (en) Apparatus and method of extracting keywords in the text
Alzahrani et al. Fuzzy semantic-based string similarity for extrinsic plagiarism detection
CN103399901B (en) A kind of keyword abstraction method
CN109960756B (en) News event information induction method
Glenisson et al. Combining full-text analysis and bibliometric indicators. A pilot study
CN106651696B (en) Approximate question pushing method and system
CN102081601B (en) Field word identification method and device
CN103064969A (en) Method for automatically creating keyword index table
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN103823896A (en) Subject characteristic value algorithm and subject characteristic value algorithm-based project evaluation expert recommendation algorithm
CN103970730A (en) Method for extracting multiple subject terms from single Chinese text
Sabuna et al. Summarizing Indonesian text automatically by using sentence scoring and decision tree
CN102567308A (en) Information processing feature extracting method
Fujii Modeling anchor text and classifying queries to enhance web document retrieval
US20110055228A1 (en) Cooccurrence dictionary creating system, scoring system, cooccurrence dictionary creating method, scoring method, and program thereof
CN108363694B (en) Keyword extraction method and device
CN106776672A (en) Technology development grain figure determines method
CN103049470A (en) Opinion retrieval method based on emotional relevancy
CN104899188A (en) Problem similarity calculation method based on subjects and focuses of problems
Ljubešić et al. Language-independent gender prediction on twitter
Kiyomarsi et al. Optimizing persian text summarization based on fuzzy logic approach
Alliheedi et al. Rhetorical figuration as a metric in text summarization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C53 Correction of patent for invention or patent application
CB03 Change of inventor or designer information

Inventor after: Fang Jun

Inventor after: Guo Lei

Inventor after: Chang Weiwei

Inventor after: Yang Ning

Inventor before: Fang Jun

Inventor before: Guo Lei

Inventor before: Chang Weiwei

COR Change of bibliographic data

Free format text: CORRECT: INVENTOR; FROM: FANG JUN GUO LEI CHANG WEIWEI TO: FANG JUN GUO LEI CHANG WEIWEI YANG NING

ASS Succession or assignment of patent right

Owner name: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140814

Owner name: COMTEC SOLAR (JIANGSU) CO., LTD.

Free format text: FORMER OWNER: NORTHWESTERN POLYTECHNICAL UNIVERSITY

Effective date: 20140814

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 710072 XI AN, SHAANXI PROVINCE TO: 226600 NANTONG, JIANGSU PROVINCE

TR01 Transfer of patent right

Effective date of registration: 20140814

Address after: 226600 the Yellow Sea Road, Haian Development Zone, Haian County, Nantong, Jiangsu

Patentee after: Comtec Solar (Jiangsu) Co., Ltd.

Patentee after: Northwestern Polytechnical University

Address before: 710072 Xi'an friendship West Road, Shaanxi, No. 127

Patentee before: Northwestern Polytechnical University