CN100492366C - Method and module for extracting summary - Google Patents

Method and module for extracting summary Download PDF

Info

Publication number
CN100492366C
CN100492366C CNB200710109499XA CN200710109499A CN100492366C CN 100492366 C CN100492366 C CN 100492366C CN B200710109499X A CNB200710109499X A CN B200710109499XA CN 200710109499 A CN200710109499 A CN 200710109499A CN 100492366 C CN100492366 C CN 100492366C
Authority
CN
China
Prior art keywords
weight
current window
content
keyword
window content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CNB200710109499XA
Other languages
Chinese (zh)
Other versions
CN101075260A (en
Inventor
袁哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CNB200710109499XA priority Critical patent/CN100492366C/en
Publication of CN101075260A publication Critical patent/CN101075260A/en
Application granted granted Critical
Publication of CN100492366C publication Critical patent/CN100492366C/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

A method for picking up abstract includes picking up current window content corresponding to current window from file when window is slide as per set step in file, calculating weight of current window content according to key word, storing current window content and corresponding weight, taking out one or multiple window content corresponding to higher weight as abstract according to stored window content and corresponding weight when window slide is ended. The abstract pick-up module used for realizing said method is also disclosed.

Description

Abstract extraction method and abstract extraction module
Technical field
The present invention relates to from document, extract technical field, particularly abstract extraction method and the abstract extraction module of summary according to keyword.
Background technology
Search engine is a kind of instrument common on the present internet.Usually, search engine utilizes reptile to obtain data from website or local computer on the network, and forms a plurality of documents.When the user used search engine to search for, search engine matched one or more documents according to the keyword that the user submits to.Then, the keyword that abstract extraction module in the search engine is submitted to according to the user, from the document that is matched, extract the content relevant as summary with keyword, this summary should be able to comprise the information relevant with keyword to greatest extent, and search engine is shown to the user by the page with the summary of each document then.
At present, the abstract extraction module is carried out location matches by keyword in document, then with the content around the keyword in the document as summary.For example, when keyword was three, these keywords had occurred in document 10 times, 12 times, 18 times, existing abstract extraction module with a part of keyword in this 30 place keyword and near content thereof as summary.
But, when 4 sentences that comprise whole keywords are arranged in the document, illustrate that these 4 sentences are very relevant with keyword, but above-mentioned prior art can not be learnt this situation, thereby extract the part of these 4 sentences as summary.In other words, the summary that existing abstract extraction method extracts and the correlation degree of keyword are lower, can not will embody with the content of keyword height correlation in the document.
Summary of the invention
In view of this, the present invention proposes a kind of abstract extraction method, in order to coming out as summary with the contents extraction of keyword matched.The invention allows for a kind of abstract extraction module.
The invention provides a kind of abstract extraction method, this method comprises:
In document,, from document, extract the current window content corresponding, calculate the weight of current window content, and preserve current window content and corresponding weight according to keyword with current window with in the process of setting the step-length moving window;
After the end of sliding,, take out the one or more windows contents corresponding as summary with higher weights according to windows content of being preserved and corresponding weight.
Described setting step-length is the least unit in the document content.
Before the weight of calculating the current window content according to keyword, further comprise and judge the step that whether comprises keyword in the current window content, and under the situation that is, calculate the weight of current window content.
The described step of calculating the weight of current window content according to keyword comprises: to the weight summation of each keyword weight as the current window content.
This method further comprises: according to the importance of each keyword, for the weight of each keyword multiply by a coefficient respectively, wherein said coefficient increases with the increase of importance; And/or, for the weight of current window content multiply by or add a coefficient, the order of the keyword of this coefficient order that keyword occurs in the current window content and input more near the time and/or in the current window content distance between the keyword big more more in short-term.
Before preserving current window content and corresponding weight, further comprise judge the current window content weight whether greater than with the weight of the overlapping windows content of current window, and under the situation that is, preserve the weight of current window content and correspondence.
The described taking-up one or more windows contents corresponding with higher weights comprise as the step of making a summary: according to weight corresponding windows content is sorted; According to the summary size, according to weight from big to small take out one or more windows contents in turn as summary.
The present invention also provides a kind of abstract extraction module, and this module comprises that storage unit, sliding unit, computing unit and summary form the unit, wherein:
Described storage unit is used to store document, windows content and corresponding weight;
Described sliding unit is used at document setting the step-length moving window, and in the process of sliding, extracts the current window content corresponding with current window and offer computing unit from document;
Described computing unit is used for calculating according to keyword the weight of current window content, and preserves current window content and corresponding weight in storage unit;
Described summary forms the unit and is used for taking out the one or more windows contents corresponding with higher weights as summary in the end back of sliding from described storage unit.
This abstract extraction module further comprises: the unit is set, is used to sliding unit that window size and sliding step are set; And/or order module is used for the windows content ordering corresponding to storage unit according to weight, so that described summary generation unit takes out the one or more windows contents corresponding with higher weights in turn as summary.
Described computing unit is further used for judging whether comprise keyword in the current window content, and calculates the weight of current window content under the situation that is; And/or, the weight that is further used for judging the current window content whether greater than with the weight of the overlapping windows content of current window, and under the situation that is, preserve current window content and corresponding weight.
From such scheme as can be seen, owing to the present invention's weight according to keyword calculation window content in the window sliding process, take out the one or more windows contents corresponding at last as summary with higher weights, can optimum response and the clip Text of keyword relation thereby take out, will offer the user as summary with the content that the keyword that the user imports is pressed close to most.
Description of drawings
Fig. 1 is the schematic flow sheet according to the abstract extraction method of the embodiment of the invention;
Fig. 2 is the structural representation according to the abstract extraction system of the embodiment of the invention.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, the present invention is described in more detail by the following examples.
Fig. 1 is the schematic flow sheet according to the abstract extraction method of the embodiment of the invention.As shown in Figure 1, the abstract extraction method of the embodiment of the invention may further comprise the steps:
Step 101 preestablishes window size and sliding step.The window here is meant the elementary cell of extracting content in document, generally includes a plurality of words.In general step-length is less than the size that equals window, otherwise just can not cover the full content of document.
Preferably, step-length is made as the least unit of content in the document, for example a Chinese character, an English word, a numeral etc.
Step 102 with the step-length set moving window forward, and is extracted the pairing content of current window.With the windows content that is called of the pairing content of window, the pairing content of current window is called the current window content here.
Step 103 judges whether comprise keyword in the current window, if comprise keyword, and then execution in step 104 and subsequent step thereof; Otherwise, execution in step 107 and subsequent step thereof.
Certainly, also can not carry out the judgement of step 103 and directly execution in step 104 and subsequent step thereof, can not be calculated as zero because do not comprise the weight of current window under the situation of keyword, thereby can not be used as summary.
Step 104 is according to the weight of keyword calculating current window content.
Here be that a plurality of situations is described with keyword.When keyword when being single, can regard the simplification special case of a plurality of keywords as.
In simple terms, the weight of windows content equals the weight summation to each keyword.For simplicity, be that example is described with two keywords.The keyword of supposing user's input is " abstract extraction " two speech, and so, the weight of windows content adds the weight of " extraction " with regard to the weight that equals " summary ".And the weight of each keyword is relevant with the number of times that this keyword occurs in the current window content, and the number of times of appearance is many more, and the weight of keyword is big more.From top description as can be seen, the weight of windows content is relevant with the number of times that number and each keyword of the keyword that wherein occurs occur.When keyword when being single, only need according to the method described above the weight of the keyword weight as the current window content is got final product, do not relate to following content.
Further, because the importance of each keyword is different, can before summation, multiply by a coefficient for the weight of each keyword.For example, " summary " occurred 784 times in document, and " extraction " occurred 98 times, it is few that the importance of the keyword that occurrence number is many in document is generally less than in document occurrence number, so the weight for " summary " before summation multiply by a less coefficient, and be the bigger coefficient that multiply by of " extraction ", thereby distinguish both importance.
Further, when the weight of calculation window content, can also the weight of windows content be revised, for example add a coefficient and/or multiply by a coefficient according to the degree of correlation of a plurality of keywords in the windows content.For instance, the sequence consensus that the order that occurs when " summary " in the windows content, " extraction " and user import or more near the time, add a bigger coefficient and/or multiply by a bigger coefficient; The order that the order that occurs when " summary " in the windows content, " extraction " and user import is inconsistent or when keeping off more, adds a less coefficient and/or multiply by a less coefficient; " summary " in windows content and the distance of " extraction " add a bigger coefficient and/or multiply by a bigger coefficient more in short-term; When the distance of " summary " in the windows content and " extraction " is big, adds a less coefficient and/or multiply by a less coefficient.
Step 105, the weight of judging the current window content whether greater than with the weight of the overlapping windows content of current window, if, then execution in step 106 and subsequent step thereof; Otherwise, abandon the current window content, then execution in step 107 and subsequent step thereof.
Step 106 is preserved current window content and corresponding weight.
In addition, current window content and corresponding weight are directly preserved in the also not judgement of execution in step 105.
Whether step 107 judge is slided and to be finished, and promptly whether reaches the document end, if, then execution in step 108 and subsequent step thereof; If not, then execution in step 102 and subsequent step thereof.
Step 108 sorts according to weight to the windows content of preserving, and might as well suppose to arrange from big to small according to the weight corresponding with it.
Step 109 according to the requirement of summary length, according to the ordering in the step 108, is taken out one or more windows contents according to from big to small order in the circle, thereby is formed summary from the windows content of arranging.
In addition, in the method for the embodiment of the invention, the also not ordering of execution in step 108, but in step 109,, take out the one or more windows contents corresponding as summary with higher weights according to windows content of being preserved and corresponding weight.
This flow process so far finishes.
Fig. 2 is the structural representation according to the abstract extraction module of the embodiment of the invention.With reference to Fig. 2, this abstract extraction device comprises that storage unit, sliding unit, computing unit and summary form the unit.
Wherein, storage unit is used to store document, windows content and corresponding weight.
Sliding unit is used at document setting the step-length moving window, and in the process of sliding, extracts the current window content corresponding with current window from document, then the current window content is offered computing unit.
Computing unit is used for calculating according to keyword the weight of current window content, and preserves current window content and corresponding weight in storage unit.Computing unit can be sued for peace as the weight of current window content by the weight to each keyword.Further, computing unit can also be according to the importance of each keyword, and for the weight of each keyword multiply by a coefficient respectively, wherein said coefficient increases with the increase of importance.In addition, computing unit can also multiply by or add a coefficient for the weight of current window content, the order of the order that this coefficient keyword in current window occurs and the keyword of input more near the time and/or in current window the distance between the keyword big more more in short-term.
In addition, computing unit can be further when receiving the current window content and not calculating the weight of current window, judge and whether comprise keyword in the current window content, and under the situation that is, calculate the weight of current window content, if do not comprise keyword in the current window content, then do not calculate the weight of current window content, receive and handle next current window content from sliding unit.
In addition, computing unit can be further when preserving the weight of current window content and correspondence, the weight of judging the current window content whether greater than with the weight of the overlapping windows content of current window, and under the situation that is, preserve current window content and corresponding weight, if the weight of current window content is not more than the weight with the overlapping windows content of current window, then abandon the current window content.
Summary forms the unit and is used for taking out the one or more windows contents corresponding with higher weights as summary in the end back of sliding from storage unit.
Continuation is with reference to Fig. 2, and the abstract extraction module of the embodiment of the invention can further include the unit is set, and this is provided with the unit and is used to sliding unit that window size and sliding step are set.
The abstract extraction module of the embodiment of the invention can further include sequencing unit, this sequencing unit is used for the windows content ordering corresponding to storage unit according to weight, so that the summary generation unit takes out the one or more windows contents corresponding with higher weights in turn as summary.
The above only is preferred embodiment of the present invention, and is in order to restriction the present invention, within the spirit and principles in the present invention not all, any modification of being done, is equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1, a kind of abstract extraction method is characterized in that, this method comprises:
In document,, from document, extract the current window content corresponding, calculate the weight of current window content, and preserve current window content and corresponding weight according to keyword with current window with in the process of setting the step-length moving window;
After the end of sliding,, take out the one or more windows contents corresponding as summary with higher weights according to windows content of being preserved and corresponding weight.
2, method according to claim 1 is characterized in that, described setting step-length is the least unit in the document content.
3, method according to claim 1, it is characterized in that, before the weight of calculating the current window content according to keyword, further comprise and judge the step that whether comprises keyword in the current window content, and under the situation that is, calculate the weight of current window content.
4, method according to claim 1 is characterized in that, the described step of calculating the weight of current window content according to keyword comprises: to the weight summation of each keyword weight as the current window content.
5, method according to claim 4 is characterized in that, this method further comprises:
According to the importance of each keyword, for the weight of each keyword multiply by a coefficient respectively, wherein said coefficient increases with the increase of importance; And/or,
For the weight of current window content multiply by or add a coefficient, the order of the keyword of this coefficient order that keyword occurs in the current window content and input more near the time and/or in the current window content distance between the keyword big more more in short-term.
6, method according to claim 1, it is characterized in that, before preserving current window content and corresponding weight, further comprise judge the current window content weight whether greater than with the weight of the overlapping windows content of current window, and under the situation that is, preserve the weight of current window content and correspondence.
7, method according to claim 1 is characterized in that, the described taking-up one or more windows contents corresponding with higher weights comprise as the step of making a summary:
According to the windows content ordering of weight to correspondence;
According to the summary size, according to weight from big to small take out one or more windows contents in turn as summary.
8, a kind of abstract extraction module is characterized in that, this abstract extraction module comprises that storage unit, sliding unit, computing unit and summary form the unit, wherein:
Described storage unit is used to store document, windows content and corresponding weight;
Described sliding unit is used at document setting the step-length moving window, and in the process of sliding, extracts the current window content corresponding with current window and offer computing unit from document;
Described computing unit is used for calculating according to keyword the weight of current window content, and preserves current window content and corresponding weight in storage unit;
Described summary forms the unit and is used for taking out the one or more windows contents corresponding with higher weights as summary in the end back of sliding from described storage unit.
9, abstract extraction module according to claim 8 is characterized in that, this abstract extraction module further comprises:
The unit is set, is used to sliding unit that window size and sliding step are set; And/or,
Order module is used for the windows content ordering corresponding to storage unit according to weight, so that described summary generation unit takes out the one or more windows contents corresponding with higher weights in turn as summary.
10, abstract extraction module according to claim 8 is characterized in that, described computing unit is further used for judging whether comprise keyword in the current window content, and calculates the weight of current window content under the situation that is; And/or,
The weight that is further used for judging the current window content whether greater than with the weight of the overlapping windows content of current window, and under the situation that is, preserve current window content and corresponding weight.
CNB200710109499XA 2007-06-28 2007-06-28 Method and module for extracting summary Active CN100492366C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB200710109499XA CN100492366C (en) 2007-06-28 2007-06-28 Method and module for extracting summary

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB200710109499XA CN100492366C (en) 2007-06-28 2007-06-28 Method and module for extracting summary

Publications (2)

Publication Number Publication Date
CN101075260A CN101075260A (en) 2007-11-21
CN100492366C true CN100492366C (en) 2009-05-27

Family

ID=38976311

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200710109499XA Active CN100492366C (en) 2007-06-28 2007-06-28 Method and module for extracting summary

Country Status (1)

Country Link
CN (1) CN100492366C (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102314448B (en) * 2010-07-06 2013-12-04 株式会社理光 Equipment for acquiring one or more key elements from document and method
CN104091058A (en) * 2014-06-27 2014-10-08 北京君和信达科技有限公司 Safety inspection conclusion submitting method and device
CN105808570A (en) * 2014-12-30 2016-07-27 北京奇虎科技有限公司 Method and device for providing abstract searching service
CN105808566A (en) * 2014-12-30 2016-07-27 北京奇虎科技有限公司 Method and device for extracting abstracts from webpages on basis of search words
CN105808552A (en) * 2014-12-30 2016-07-27 北京奇虎科技有限公司 Method and device for extracting abstract from webpage based on slide window
CN107451302B (en) * 2017-09-22 2020-08-28 深圳大学 Modeling method and system based on position top-k keyword query under sliding window
CN108628833B (en) * 2018-05-11 2021-01-22 北京三快在线科技有限公司 Method and device for determining summary of original content and method and device for recommending original content
CN109522402A (en) * 2018-10-22 2019-03-26 国家电网有限公司 A kind of abstract extraction method and storage medium based on power industry characteristic key words

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
自动文摘的四种主要方法. 刘挺,王开铸.情报学报,第18卷第1期. 1999
自动文摘的四种主要方法. 刘挺,王开铸.情报学报,第18卷第1期. 1999 *

Also Published As

Publication number Publication date
CN101075260A (en) 2007-11-21

Similar Documents

Publication Publication Date Title
CN100492366C (en) Method and module for extracting summary
CN105955976B (en) A kind of automatic answering system and method
US9106698B2 (en) Method and server for intelligent categorization of bookmarks
CN101334768B (en) Method and system for eliminating ambiguity for word meaning by computer, and search method
CN109582704B (en) Recruitment information and the matched method of job seeker resume
US8914363B2 (en) Disambiguating tags in network based multiple user tagging systems
CN102725759A (en) Semantic table of contents for search results
US20130173610A1 (en) Extracting Search-Focused Key N-Grams and/or Phrases for Relevance Rankings in Searches
CN106776567B (en) Internet big data analysis and extraction method and system
CN101458708B (en) Searching result clustering method and device
CN102855252B (en) A kind of need-based data retrieval method and device
WO2006108069A2 (en) Searching through content which is accessible through web-based forms
CN102138142A (en) Dictionary suggestions for partial user entries
CN105630940B (en) A kind of information retrieval method based on readable index
CN106484797A (en) Accident summary abstracting method based on sparse study
US20040158558A1 (en) Information processor and program for implementing information processor
US7181688B1 (en) Device and method for retrieving documents
JP5718405B2 (en) Utterance selection apparatus, method and program, dialogue apparatus and method
US8799268B2 (en) Consolidating tags
CN108345694B (en) Document retrieval method and system based on theme database
CN110008312A (en) A kind of document writing assistant implementation method, system and electronic equipment
CN114443847A (en) Text classification method, text processing method, text classification device, text processing device, computer equipment and storage medium
CN110059253A (en) A kind of sort method and system and equipment based on natural language analysis
CN102122296B (en) Search result clustering method and device
CN107315735B (en) Method and equipment for note arrangement

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant