CN101620611A - Method of generating conceptual titles - Google Patents

Method of generating conceptual titles Download PDF

Info

Publication number
CN101620611A
CN101620611A CN200810127624A CN200810127624A CN101620611A CN 101620611 A CN101620611 A CN 101620611A CN 200810127624 A CN200810127624 A CN 200810127624A CN 200810127624 A CN200810127624 A CN 200810127624A CN 101620611 A CN101620611 A CN 101620611A
Authority
CN
China
Prior art keywords
vocabulary
conceptual
titles
file
notional word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200810127624A
Other languages
Chinese (zh)
Other versions
CN101620611B (en
Inventor
曾元显
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WEBGENIE INFORMATION Ltd
Original Assignee
WEBGENIE INFORMATION Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WEBGENIE INFORMATION Ltd filed Critical WEBGENIE INFORMATION Ltd
Priority to CN2008101276244A priority Critical patent/CN101620611B/en
Publication of CN101620611A publication Critical patent/CN101620611A/en
Application granted granted Critical
Publication of CN101620611B publication Critical patent/CN101620611B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a method of generating conceptual titles, comprising the following steps: firstly, excerpting a plurality of characteristic words from a file cluster; secondly, researching a plurality of conceptual words related to the characteristic words in a hierarchical knowledge structure; thirdly, automatically generating a conceptual title capable of covering the contents of the files according to the hierarchical depth of the conceptual words and time of selection. Therefore, compared with the traditional method of manually analyzing files and naming titles, working load is reduced.

Description

Method of generating conceptual titles
Technical field
The invention relates to a kind of title production method, and the method for generating conceptual titles of particularly gathering together relevant for a kind of file.
Background technology
In the epoch of this information explosion, how in the huge file of quantity, find needed information fast and effectively, just like become a considerable knowledge.Therefore various relevant for the document classification (document categorization) and the researchs of document clustering (document clustering) just playing the part of when improving document retrieval, analysis and efficiency of managing indispensable role.Yet, be different from the document classification, all pre-defined the set of label (label) or vocabulary (term) for each classification (category); Document clustering need specify concise and to the point record title to understand cluster result to help the analyst after document clustering is become a plurality of gathering together.The title of current file cluster is to be that institute captures out the vocabulary of usefulness from file itself mostly, and the practice although it is so has its legitimacy, yet this is not enough to the content of All Files in the same classification of recapitulative description usually.Particularly when file content is contained suitable extensive fields knowledge, more need to find out a conceptual title (generic topic), to alleviate the needed burden of file analysis.
In the application of document clustering, need to replace one group of file to gather together inevitably and indicate a title.Present stage is used for named file " the important vocabulary " that the method for title normally relies in these files of acquisition of gathering together and finishes, and for different cluster algorithms, " the important vocabulary " of its calculating is usually far different.
When using vector space model to express documentation, it is that weighting sum total or its central point with file vector represented that file is gathered together.And in these file vectors, the vocabulary with highest weighting value will be used as the title that file is gathered together.For instance, at Cutting, in the method for gathering together that et.al and Marit A, et.al propose, (term frequency is TF) as the weighted value of each vocabulary in the file vector to be to use the normalized frequency of occurrences; And at Yiming Yang, in the method that et.al proposes, (inverse document frequency, product IDF) is as weighted value then to be to use TF and inverted file frequency.
At Krista Lagus, in self's drawing method that et.al proposed, it is the expression X-Y scheme that file is gathered together, and the vocabulary that wherein has the most high-quality measurement (goodness measure) then is used as the title that file is gathered together.
At Russell Swan, in the application of the classification vocabulary that et.al proposed with the detecting incident, the title that file is gathered together is to connect the highest noun phrase of ordering by the highest name entities (name entity) of ordering to combine.Wherein, the cis-position of these vocabulary is that card side (chi-square) value ordering by the vocabulary that will occur in the time interval gets.
And at Oren Zamir, the web page files that et.al proposed is gathered together in the method, then the title of gathering together as this file with the long word group that all occurs in most of files of gathering together at file.
At other similarly is in the association area of document and translation, and (DocumentUnderstanding Conference, main task DUC) is how to produce very short summary in the document understanding meeting.Summary about these short about 10 words has as the gather together possibility of title of file.Yet, the practice of this meeting majority all is to use the method for taking passages vocabulary from file, and these methods still need to train one " translation model " by file set with artificial specified title is incompatible, the file word film festival can be penetrated (map) and be artificial specified title.In addition, these are made a summary for file set, be mostly to tend to the event description guiding, but not the subject description guiding.
Above-mentioned method is to choose vocabulary from file content itself to be used as the title that file is gathered together mostly, yet the domain knowledge that content contained of gathering together when file is quite extensively the time, the selected title that comes out of said method lacks concept nature mostly, and can't fully summarize the content that these files will be represented.Therefore, present solution still must rely on the professional, names with the title that artificial mode is gathered together to file, and this measure singly can not cause the great amount of manpower cost, also can reduce the efficient of document classification.
Summary of the invention
In view of this, purpose of the present invention is providing a kind of method of generating conceptual titles exactly, by taking passages the feature vocabulary in the many pieces of files, and in the hierarchy type structure of knowledge, hunt out a plurality of notional words of corresponding each feature vocabulary, and can select the most suitable conceptual titles that is used for summarizing these file contents according to the weighted value of these notional words.
To achieve these goals, the present invention proposes a kind of method of generating conceptual titles, in order to produce the conceptual titles that can summarize the file content of a plurality of files.This method comprises the following steps: at first, takes passages a plurality of feature vocabulary in these files.Then, in a stratum character structure of knowledge, look for a plurality of notional words of corresponding these feature vocabulary, and calculate stratum's degree of depth of each notional word and choose number of times, wherein choose number of times and be meant when looking for notional word the number of times summation that each notional word is selected with these feature vocabulary.Last again according to stratum's degree of depth of each notional word with choose number of times, calculate the weighted value of each notional word, and choose notional word with the highest weighted value conceptual titles as these files.
According to the method for generating conceptual titles under the preferred embodiment of the present invention, wherein according to the vocabulary selection rule, a plurality of feature vocabulary of taking passages in these files comprise the following steps: to handle these files according to disconnected speech and keyword acquisition strategy earlier, to obtain a plurality of candidate's vocabulary.Then, calculate a plurality of related coefficients that candidate's vocabulary is associated with a plurality of item names in these files.And choose related coefficient greater than the feature vocabulary of certain particular value or related coefficient ordering several candidate's vocabulary up front as these files.In addition, can also calculate the frequency of occurrences of each candidate's vocabulary in these files, related coefficient and its frequency of occurrences of each candidate's vocabulary are multiplied each other, choose product greater than the feature vocabulary of certain particular value or product ordering several candidate's vocabulary up front as these files.
According to the described method of generating conceptual titles of preferred embodiment of the present invention, wherein the hierarchy type structure of knowledge is a tree structure, comprise root node and a plurality of child node, and root node and all child nodes is to be used for representing synonym vocabulary.Wherein the binding between root node and the child node is set up by hypernym relation (hypernym) or broad sense speech semantic relations such as (broader term).
According to the described method of generating conceptual titles of preferred embodiment of the present invention, wherein in a hierarchy type structure of knowledge, look for a plurality of notional words of corresponding these feature vocabulary, and the stratum's degree of depth that calculates each notional word and the step of choosing number of times, comprise: find out the synonym vocabulary of corresponding these feature vocabulary and the child node position of representative thereof earlier, choose again by child node all synonym vocabulary on the path of root node with upperseat concept speech as these feature vocabulary.Wherein, the number of times that synonym vocabulary on these paths is selected is the number of times of choosing as each notional word, and for each feature vocabulary, can be when this feature vocabulary arrive root node as different paths, upperseat concept speech identical on its path can only calculate once.
According to the described method of generating conceptual titles of preferred embodiment of the present invention, its middle-class degree of depth is that the root node with the hierarchy type structure of knowledge is a benchmark, successively is added to the child node of corresponding notional word downwards and gets.
According to the described method of generating conceptual titles of preferred embodiment of the present invention, wherein weighted value comprises that the S type function (Sigmoid function) that is proportional to stratum's degree of depth is worth and chooses number of times.
The present invention is assisting by the hierarchy type structure of knowledge, with the notional word in this structure of knowledge of the corresponding one-tenth of the feature vocabulary in the many pieces of files, and according to stratum's degree of depth of each notional word with choose number of times, calculate the weighted value size of these notional words, to reach the purpose of automatic generation conceptual titles in view of the above.
For above and other objects of the present invention, feature, advantage can be become apparent, preferred embodiment of the present invention cited below particularly, and cooperate appended accompanying drawing, be described in detail below.
Description of drawings
Fig. 1 is the process flow diagram of the method for generating conceptual titles of preferred embodiment of the present invention;
Fig. 2 takes passages the example of conceptual titles according to gathering together from file of preferred embodiment of the present invention;
Fig. 3 takes passages the example of conceptual titles according to gathering together from file of preferred embodiment of the present invention;
Fig. 4 is that the patent document according to preferred embodiment of the present invention is clustered into the thematic map that 6 files are gathered together;
Fig. 5 is that the patent document according to preferred embodiment of the present invention is clustered into the detailed content that 6 files are gathered together;
Fig. 6 takes passages the example of conceptual titles according to gathering together from file of preferred embodiment of the present invention.
[main element symbol description]
110~140: each step of the described method of generating conceptual titles of preferred embodiment of the present invention
131,240,495,650,412,90,168,883,631,603,727,226,899,853,219,388,355,492,12,712,273: the document theme identification code
1,2,3,4,5,6: file is gathered together
Embodiment
In order to make content of the present invention more clear, below the example that can implement according to this really as the present invention especially exemplified by embodiment.
Method of generating conceptual titles of the present invention is that the feature vocabulary that has the meaning represented in the file is taken passages out, and assisting according to the hierarchy type structure of knowledge, find out pairing all notional words of these feature vocabulary, at last therefrom determine the conceptual titles that is enough to describe the All Files content again, with next each step that encyclopaedizes method of generating conceptual titles of the present invention for embodiment.
Fig. 1 is the process flow diagram according to the method for generating conceptual titles of preferred embodiment of the present invention.Please refer to Fig. 1, at first from many pieces of files, take passages out a plurality of feature vocabulary (step 110).Wherein, the method for taking passages feature vocabulary for example comprises utilizes disconnected speech strategy with All Files break speech and keyword acquisition processing earlier, and obtains many feature vocabulary, then selects suitable vocabulary as feature vocabulary again from these candidate's vocabulary.
For fear of when screening feature vocabulary, be subjected to the influence of the different methods of gathering together, and the related coefficient between present embodiment employing calculating each vocabulary T and the classification C (Correlation Coefficient, CC), decide whether taking of feature vocabulary, the computing formula of related coefficient CC is as follows:
CC ( T , C ) = ( TP × TN - FN × FP ) ( TP + FN ) ( FP + TN ) ( TP + FP ) ( TN + FN )
Wherein, TP (true positive), FP (false positive), FN (false negative) and TN (true negative) represent the article sum that belongs to classification C and comprise vocabulary T respectively, do not belong to classification C but comprise vocabulary T the article sum, belong to classification C but do not comprise the article sum of vocabulary T and do not belong to classification C and do not comprise the article sum of vocabulary T.
After the related coefficient of calculating all candidate's vocabulary and classification, can be according to size or its ordering of the related coefficient of each candidate's vocabulary correspondence, for example judge that it is whether greater than a certain particular value (for example 0.7) or ordering in front, determines whether selecting this candidate's vocabulary as feature vocabulary.
On the other hand, the preferable way of present embodiment can be calculated the frequency of occurrences (the Term Frequency in the Cluster of each candidate's vocabulary in these files again, TFC), and with the related coefficient CC of each candidate's vocabulary and its frequency of occurrences TFC multiply by mutually obtain a product (CC * TFC), finally choose again product greater than certain particular value or the candidate's vocabulary that sorts in front as feature vocabulary.
Except the way of the correlativity of aforementioned calculation candidate vocabulary, present embodiment comprises that also calculating is in file is gathered together, occur the quantity of documents of this candidate's vocabulary, when this quantity during, can select this candidate's vocabulary as feature vocabulary greater than certain particular value (for example account for all files 50 percent).
The method of blanket above-mentioned screening feature vocabulary, related coefficient CC is suitable for calculating the reference as take passages out feature vocabulary from a large amount of short articles.Otherwise, when the article length is long, the foundation when then suitable product CC * TFC with related coefficient CC and frequency of occurrences TFC is used as selected characteristic vocabulary.
After determining the feature vocabulary of these files, then follow in a hierarchy type structure of knowledge, look for all notional words of corresponding each feature vocabulary, and calculate stratum's degree of depth of each notional word and choose number of times (step 120).Wherein, the above-mentioned number of times of choosing is meant when looking for notional word with these feature vocabulary the selected number of times summation of getting of each notional word.
It is a kind of tree structure that is made of a root node and many child nodes that the hierarchy type structure of knowledge can be considered itself, wherein no matter be root node also or child node all is used for representing one group of synonym vocabulary, and between this root node and child node or the binding between each child node, then be to see through a semantic relation to set up.WordNet is an example with English synonym dictionary: comprised many synonym vocabulary (being English-word) in its structure of knowledge, and can set up synonym vocabulary relation each other according to for example hypernym relation (hypernym), hyponym relation (hyponym), whole relation (holonym) or attached relation semantic relations such as (meronym).But only need use its hypernym relation or broad sense speech (broader term) relation in the present invention gets final product.These synonym vocabulary can be classified into again among the different synonym set such as for example name set of words, verb set, adjective set or adverbial word set etc.In the above-mentioned hierarchy type structure of knowledge, child node represents its semantic relation more upper near root node more, and its pairing synonym is also extensive more general with converging.
In the hierarchy type structure of knowledge, a vocabulary can have more than one notional word (hypernym) usually, and a vocabulary may also be the notional word of other a plurality of vocabulary.Such as the notional word of dog for example comprises animal, creature; Gesture then is the notional word of nod, shrug and hug.In the present embodiment, on the path of root node, all child nodes that run into all can selected notional word as these feature vocabulary from the child node of its representative of synonym vocabulary of each character pair vocabulary.
It should be noted that in the middle of the hierarchy type structure of knowledge path that leads to root node from a child node may be not only one.Therefore, calculate notional word choose number of times the time, no matter these paths can be through the node how many times of the same notional word of representative, for same feature vocabulary, the choosing number of times and all can only calculate once of this notional word.In addition, except choosing number of times, also must calculate stratum's degree of depth of each notional word.Stratum's degree of depth of present embodiment hypothesis root node is 0, and stratum's degree of depth of its next straton node is 1, by that analogy.Therefore, by calculating stratum's number of each nodal distance root node, can obtain stratum's degree of depth of the notional word of this node representative.
In the hierarchy type structure of knowledge, the meaning of a word of upper strata (being that stratum's degree of depth is low more) vocabulary is wide more more, therefore when selecting notional word to be the conceptual titles of file, if select comparatively generalized concept speech, the content of these files of expression that can't be clear and definite; And, then can't summarize the content of All Files if select the comparatively notional word of narrow sense.Therefore, only method is exactly in the notional word that can contain all feature vocabulary, selects to have the conceptual titles of the notional word of the summit degree of depth as these files.
According to above-mentioned principle, present embodiment next step then be according to the stratum degree of depth of these notional words in the hierarchy type structure of knowledge and choose number of times, calculates the weighted value (step 130) of these notional words.Wherein, the weighted value of present embodiment hypothesis notional word is proportional to S type function (Sigmoid function) value of stratum's degree of depth, and is proportional to and chooses number of times, and calculates the weighted value weight of each notional word with following formula:
weight = f nt × 2 × ( 1 1 + exp - c × d - 0.5 )
Wherein, f represents the number of times of choosing of notional word, and d represents stratum's degree of depth at notional word place, and nt is the sum that is used for representing all feature vocabulary, and c then is a constant (for example 0.125).Yet the formula of above-mentioned weighted value is not in order to limiting the present invention, to know art technology person when visual actual needs, according to each notional word choose number of times and stratum's degree of depth, adopt suitable weighted value computing formula to calculate its weighted value.
At last, the weighted value of each notional word size is sorted, and choose notional word with weight limit value conceptual titles (step 140) as these files.In the present embodiment, simultaneously can summarize the notional word of all feature vocabulary as far as possible by finding out as conceptual titles in the hierarchy type structure of knowledge, and be not limited only to literal existing in the article, than prior art, the content of these files more can be summarized and represent to the title that is obtained.
For instance, suppose that the feature vocabulary of taking passages out comprises table, chair, bed from a plurality of files, and in 1.6 editions these hierarchy type structures of knowledge of WordNet, via the aforementioned calculation method, the notional word of the highest weighting value that is obtained is furniture (its weighted value is 0.3584).Apparently, furniture will be more suitable for being used as the conceptual titles of these files than table, chair or bed.
A preferred embodiment of the present invention is to adopt with (the National ScienceCouncil of state science commission, NSC) as keyword, (United States Patent andTrademark Office, 612 pieces of patent documents collecting gained on website USPTO) are as analyzing target in United States Patent (USP) trademark office.These lengths patent document about two K words (English words) mostly not only have suitable rich knowledge content, and be classified to International Classification of Patents (International PatentClassification mostly, IPC) or US patent class (US Patent Classification is among category item UPC).Yet wherein some category item but is defined too narrowly or is too extensive, and content that can't complete elaboration file.Therefore, present embodiment is by method of generating conceptual titles of the present invention, and these document clusterings are become after file gathers together, and gathers together for this file and sets up a suitable conceptual titles.
These 612 pieces of patent documents all pass through processing such as content analysis, information extraction, joint time cutting and autoabstract.With the incoherent structured message of subject analysis for example applicant, inventor and International Classification of Patents sign indicating number data such as (IPC codes) be removed earlier, remaining part is again according to the form of patent document, and cutting is a plurality of main blocks (for example comprising summary, claim, field that the present invention belongs to, background of invention, summary of the invention and embodiment).And because the length of these blocks is neither together, therefore from each block, select maximum 6 best sentences with the method for taking passages again, and polyphone becomes file and substitutes literary composition (documentsurrogate), so that the usefulness as key vocabularies extraction, co-occurrence lexical analysis (co-word analysis), index and the cluster of file to be provided.Wherein, because the block of claim comprises the content of legality, so these files substitute the part that does not comprise claim in the literary composition.
Present embodiment is to substitute in the literary composition at above-mentioned file, take passages out 19343 vocabulary that in every piece of file, occurred at least more than twice, then utilize the automatic constructing method of co-occurrence lexical analysis again, from these 19343 vocabulary, find out and often appear at the vocabulary in the same sentence together and obtain 2714 vocabulary.Then, with complete binding method (complete-link method) these 2714 vocabulary are clustered into 353 small-sized gathering together again, from then on be clustered into 101 medium-sized gathering together among the result again, owing to still there be too many to gather together, therefore be 33 large-scale gathering together in conjunction with medium-sized gathering together again, obtain 10 themes gather together (topic cluster) at last.Through above-mentioned steps, 612 pieces of files originally collecting can be clustered into 10 negligible amounts and the convenient theme of analyzing.In addition, present embodiment is also represented in addition the content (being that aforementioned document substitutes literary composition) of this document with the autoabstract result of every piece of file, directly becomes 6 to gather together these 612 pieces of document clusterings via the above-mentioned multistage layer file clustering method of gathering together.
Next step then is from these 10 files are gathered together, and takes passages out a plurality of feature vocabulary respectively.In the present embodiment, the method of taking passages feature vocabulary is to adopt product according to the vocabulary frequency of occurrences in the All Files and related coefficient in gathering together (size of TFC * CC) is found out preceding 5 vocabulary and is used as the feature vocabulary that this file is gathered together from each file is gathered together.
Fig. 2 and Fig. 3 take passages the example of conceptual titles according to gathering together from file shown in the preferred embodiment of the present invention.Please consult Fig. 2 and Fig. 3 simultaneously, the content on its second hurdle is respectively that above-mentioned 612 parts of patent documents are clustered into 6 and gather together and 10 when gathering together, according to the size of the product of its frequency of occurrences and related coefficient, 5 selected feature vocabulary.
After finding out the feature vocabulary that each file gathers together, just can determine the conceptual titles that each file is gathered together according to method of generating conceptual titles of the present invention.Present embodiment is to utilize hierarchy type structure of knowledge WordNet, looks for the pairing notional word of each feature vocabulary, and calculates the weighted value of each notional word, and its search and computing method are identical with preceding embodiment, so do not repeat them here.See also Fig. 2 and Fig. 3, its third column enumerate each file gather together in first three the highest notional word of weighted value, the notional word of representing with boldface letter then is the conceptual titles of being gathered together by the most suitable this document of artificial judgment.
In Fig. 2, bold-faced notional word has occurred 3 times in 6 are gathered together, and in Fig. 3, bold-faced notional word has then occurred 5 times in 10 are gathered together.Hence one can see that, no matter form by the different modes cluster 6 gather together or 10 gather together, the conceptual titles that adopts method of the present invention to produce all can reach 50% accuracy rate.
Below the present invention and other conceptual titles generation instrument are made comparisons, to verify effect of the present invention.For instance, (Stanford University) proposes to be used for looking among the instrument InfoMap of literal classification in stanford university, when a string vocabulary of input, InfoMap promptly can export the item name (taxonomic class) of a plurality of suggestions according to the vocabulary of input.The 4th hurdle of Fig. 2 and Fig. 3 is after the feature vocabulary input InfoMap that each file of present embodiment is gathered together, first three item name of the rank that this instrument produced.Same, the item name of representing with boldface letter among the figure can be considered correct conceptual titles.With reference to the comparative result of Fig. 2 and Fig. 3, gather together with 10 gathering together respectively at 6, adopt method of the present invention and the conceptual titles that utilizes InfoMap to obtain all can to reach 50% accuracy rate.
Fig. 4 is clustered into the thematic map (topicmap) that is illustrated after 6 files are gathered together with 612 pieces of patent documents.See also Fig. 4, each circle is represented a specific document theme among the figure, and the size of circle is the expression patent document quantity that this document theme comprised, and the numeral in the circle then is to be used for representing this document theme identification code (ID).According to multidimensional scaling technology (Multi-Dimensional Scaling), the relative distance of each document theme in Fig. 4 can be used for representing the degree of correlation of its data.Wherein, the document theme that each file is gathered together and comprised, and the identification code of each document theme, quantity of documents and the details of taking passages the feature vocabulary that comes out then are shown in Fig. 5.
Illustrate with next, by assisting of method of generating conceptual titles of the present invention, indicate the conceptual titles (as the Chemistry among Fig. 4, Electronics andSemi-conductors, Generality, Communication and computers, Material, Biomedicine) that these 6 files are gathered together.See also identification code among Fig. 4 and be 853 document theme, identification code and be 219 document theme, and identification code is that 388 the file that document theme constituted was gathered together for 4 (comprising 126 pieces of patent documents altogether), as shown in Figure 5, from file gather together 4 the feature vocabulary taking passages out include output, signal, circuit, input and frequency.According to these feature vocabulary, the present invention finds out first three the highest notional word of weighted value and is respectively communication, signal, and relation (shown in the third column of Fig. 2), and file management personnel result in view of the above order the computers into Communication and with gather together 4 conceptual titles of file.
Another preferred embodiment of the present invention is to adopt to collect the 6018 piece files relevant with the financial field gather together as the file of testing from Reuter (Reuters) file, and wherein every piece of file itself all is referred among at least one category item according to file classifying method.According to these category item, these files can be divided into 10 gather together (second hurdle that the title of category item sees also Fig. 6).Because the length of these files is little, thus present embodiment be calculate gather together in the size of related coefficient of each vocabulary, to look for the feature vocabulary that each is gathered together, the third column of Fig. 6 is enumerated the feature vocabulary of selecting from each is gathered together.
Then, utilize method of generating conceptual titles of the present invention and hierarchy type structure of knowledge WordNet, from these feature vocabulary, find out first three the most suitable notional word (seeing also the 4th hurdle of Fig. 6) that is used for representing this content of gathering together; Simultaneously, also utilize InfoMap,, produce first three item name (seeing also the 5th hurdle of Fig. 6) of the content that is enough to most to represent to gather together according to each feature vocabulary of gathering together.Wherein, the vocabulary that manually is judged as the optimum content that is used for representing gathering together promptly represented in the vocabulary of representing with boldface letter.As shown in Figure 6, in whole 10 are gathered together, have to be denoted as gathering together of boldface letter title and promptly to have accounted for wherein 7, on behalf of the conceptual titles that uses method of the present invention to try to achieve, this promptly to reach 70% accuracy rate.
What deserves to be mentioned is, utilize this hierarchy type structure of knowledge of WordNet to look for notional word equally, but for the different quality as a result that document source produced gap to some extent.This depends on whether the hierarchy type structure of knowledge itself has fully to contain from file takes passages the feature vocabulary of coming out, and in this hierarchy type structure of knowledge, can the last hyponym between each vocabulary concern the ken that reflect these files.In the above-described embodiments, take passages the feature vocabulary that comes out because WordNet fails to contain all from patent document, the conceptual titles that uses method of the present invention to produce only can reach 50% accuracy rate.If the content of file itself can more meet vocabulary among the WordNet (for example with the file of Reuter as process object), use the present invention will produce better result.Vice versa, that is to say, if can find and adopt the hierarchy type structure of knowledge that meets pending file word, equally also can improve usefulness of the present invention.
In sum, in method of generating conceptual titles of the present invention, by taking passages the feature vocabulary that each file is gathered together, in the hierarchy type structure of knowledge, look for a plurality of notional words of corresponding these feature vocabulary, and according to stratum's degree of depth of these notional words and choose number of times, automatically producing one is enough to contain these files conceptual titles of content of gathering together, and must rely on manpower to carrying out the burden of file analysis and title name and can alleviate in a large number in the past.
Though the present invention discloses as above with an embodiment; right its is not in order to qualification the present invention, any person skilled in the art person, without departing from the spirit and scope of the present invention; when can doing various must the change and retouching, so protection scope of the present invention is when being as the criterion with the accompanying Claim scope.

Claims (11)

1, a kind of method of generating conceptual titles in order to produce a conceptual titles of the file content that can summarize a plurality of files, is characterized in that, comprises the following steps:
A takes passages a plurality of feature vocabulary in described a plurality of file;
B, in a hierarchy type structure of knowledge, look for a plurality of notional words of corresponding described a plurality of feature vocabulary, and the single order layer depth and that calculates each described notional word is chosen number of times, wherein this is chosen number of times and is meant when looking for described a plurality of notional word with described a plurality of feature vocabulary the number of times summation that each described notional word is selected;
C chooses number of times according to this stratum's degree of depth of each described notional word and this, calculates a weighted value; And
D, choose have this higher weighted value this notional word as this conceptual titles.
2, method of generating conceptual titles according to claim 1 is characterized in that, this step a comprises:
A1 handles speech and the keyword acquisition of breaking respectively of described a plurality of files, obtains a plurality of candidate's vocabulary;
A2 calculates the related coefficient of the described a plurality of candidate's vocabulary in described a plurality of file; And
A3, choose this related coefficient greater than a particular value or related coefficient ordering described a plurality of candidate's vocabulary in front as described a plurality of feature vocabulary.
3, method of generating conceptual titles according to claim 2 is characterized in that, wherein this step a3 also comprises:
Calculate the frequency of occurrences of each described candidate's vocabulary in described a plurality of files;
This related coefficient and this frequency of occurrences of each described candidate's vocabulary are multiplied each other, obtain a product; And
Choose this product greater than this particular value or this product ordering described a plurality of candidate's vocabulary in front as described a plurality of feature vocabulary.
4, method of generating conceptual titles according to claim 1, it is characterized in that, this hierarchy type structure of knowledge is a tree structure, comprise a root node and a plurality of child node, and this root node and described a plurality of child node are in order to representing a plurality of synonym vocabulary, and the binding between this root node and the described a plurality of child node is then set up by a semantic relation.
5, method of generating conceptual titles according to claim 4 is characterized in that, this step b comprises:
Find out described a plurality of synonym vocabulary of corresponding described a plurality of feature vocabulary and described a plurality of child nodes of representative thereof; And
Choose by described a plurality of child nodes on the path of this root node all described a plurality of synonym vocabulary as described a plurality of notional words.
6, method of generating conceptual titles according to claim 5 is characterized in that, this step b also comprises:
Calculate number of times that described a plurality of synonym vocabulary is selected and choose number of times as this of each described notional word.
7, method of generating conceptual titles according to claim 6 is characterized in that, when this that calculates each described notional word chosen number of times, for each described feature vocabulary, identical described a plurality of notional words can only calculate once.
8, method of generating conceptual titles according to claim 4 is characterized in that, this stratum's degree of depth is that this root node with this hierarchy type structure of knowledge is a benchmark, successively is added to downwards this child node that should notional word and gets.
9, method of generating conceptual titles according to claim 4 is characterized in that, this semantic relation comprises that hypernym relation or broad sense speech concern one of them.
10, method of generating conceptual titles according to claim 1 is characterized in that, this weighted value comprises the S type function value that is proportional to this stratum's degree of depth.
11, method of generating conceptual titles according to claim 1 is characterized in that, this weighted value comprises that being proportional to this chooses number of times.
CN2008101276244A 2008-06-30 2008-06-30 Method of generating conceptual titles Active CN101620611B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008101276244A CN101620611B (en) 2008-06-30 2008-06-30 Method of generating conceptual titles

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008101276244A CN101620611B (en) 2008-06-30 2008-06-30 Method of generating conceptual titles

Publications (2)

Publication Number Publication Date
CN101620611A true CN101620611A (en) 2010-01-06
CN101620611B CN101620611B (en) 2011-06-22

Family

ID=41513850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008101276244A Active CN101620611B (en) 2008-06-30 2008-06-30 Method of generating conceptual titles

Country Status (1)

Country Link
CN (1) CN101620611B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156100A (en) * 2015-04-02 2016-11-23 阿里巴巴集团控股有限公司 A kind of web page title treating method and apparatus
CN106383817A (en) * 2016-09-29 2017-02-08 北京理工大学 Paper title generation method capable of utilizing distributed semantic information
WO2017032084A1 (en) * 2015-08-24 2017-03-02 北京云知声信息技术有限公司 Information output method and apparatus
CN107885718A (en) * 2016-09-30 2018-04-06 腾讯科技(深圳)有限公司 Semanteme determines method and device
CN110210017A (en) * 2019-04-29 2019-09-06 厦门一品威客网络科技股份有限公司 A kind of automatic naming method, device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1384454A (en) * 2001-05-01 2002-12-11 株式会社东芝 Information generalizing system and method

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156100A (en) * 2015-04-02 2016-11-23 阿里巴巴集团控股有限公司 A kind of web page title treating method and apparatus
CN106156100B (en) * 2015-04-02 2019-09-03 阿里巴巴集团控股有限公司 A kind of web page title treating method and apparatus
WO2017032084A1 (en) * 2015-08-24 2017-03-02 北京云知声信息技术有限公司 Information output method and apparatus
CN106383817A (en) * 2016-09-29 2017-02-08 北京理工大学 Paper title generation method capable of utilizing distributed semantic information
CN106383817B (en) * 2016-09-29 2019-07-02 北京理工大学 Utilize the Article Titles generation method of distributed semantic information
CN107885718A (en) * 2016-09-30 2018-04-06 腾讯科技(深圳)有限公司 Semanteme determines method and device
CN107885718B (en) * 2016-09-30 2020-01-24 腾讯科技(深圳)有限公司 Semantic determination method and device
CN110210017A (en) * 2019-04-29 2019-09-06 厦门一品威客网络科技股份有限公司 A kind of automatic naming method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN101620611B (en) 2011-06-22

Similar Documents

Publication Publication Date Title
US11182440B2 (en) Methods and apparatus for searching of content using semantic synthesis
Agrawal et al. A detailed study on text mining techniques
Hotho et al. Information retrieval in folksonomies: Search and ranking
Medelyan et al. Mining meaning from Wikipedia
US8346534B2 (en) Method, system and apparatus for automatic keyword extraction
CN104765779A (en) Patent document inquiry extension method based on YAGO2s
CN101620611B (en) Method of generating conceptual titles
Benitez et al. Semantic knowledge construction from annotated image collections
Abulaish et al. A supervised learning approach for automatic keyphrase extraction
Laniado et al. A semantic tool to support navigation in a folksonomy
Chen et al. Automatically generating an e-textbook on the web
Andrews et al. Semantic disambiguation in folksonomy: a case study
CN109815495B (en) Method for performing topic facet mining through label propagation algorithm
Gupta et al. Document summarisation based on sentence ranking using vector space model
Majid et al. Semantics in social tagging systems: A review
Huang et al. Learning to find comparable entities on the web
Chi et al. The designing of a web page recommendation system for ESL
Singh et al. Clustering of blogs with enhanced semantics
Bhaskar et al. Tweet Contextualization (Answering Tweet Question)-the Role of Multi-document Summarization.
Ibekwe‐SanJuan Information Science in the web era: A term‐based approach to domain mapping
Takahashi et al. Hierarchical Summarizing and Evaluating for Web Pages.
Eisenberg et al. Toward semantic search for the biogeochemical literature
Wu et al. A personalized intelligent web retrieval system based on the knowledge-base concept and latent semantic indexing model
Yang et al. Organizing the Web: Semi-Automatic Construction of a Faceted Scheme.
Amitay What lays in the layout

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant