CN100492366C - Method and module for extracting summary - Google Patents
Method and module for extracting summary Download PDFInfo
- Publication number
- CN100492366C CN100492366C CNB200710109499XA CN200710109499A CN100492366C CN 100492366 C CN100492366 C CN 100492366C CN B200710109499X A CNB200710109499X A CN B200710109499XA CN 200710109499 A CN200710109499 A CN 200710109499A CN 100492366 C CN100492366 C CN 100492366C
- Authority
- CN
- China
- Prior art keywords
- weight
- current window
- content
- keyword
- window content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
A method for picking up abstract includes picking up current window content corresponding to current window from file when window is slide as per set step in file, calculating weight of current window content according to key word, storing current window content and corresponding weight, taking out one or multiple window content corresponding to higher weight as abstract according to stored window content and corresponding weight when window slide is ended. The abstract pick-up module used for realizing said method is also disclosed.
Description
Technical field
The present invention relates to from document, extract technical field, particularly abstract extraction method and the abstract extraction module of summary according to keyword.
Background technology
Search engine is a kind of instrument common on the present internet.Usually, search engine utilizes reptile to obtain data from website or local computer on the network, and forms a plurality of documents.When the user used search engine to search for, search engine matched one or more documents according to the keyword that the user submits to.Then, the keyword that abstract extraction module in the search engine is submitted to according to the user, from the document that is matched, extract the content relevant as summary with keyword, this summary should be able to comprise the information relevant with keyword to greatest extent, and search engine is shown to the user by the page with the summary of each document then.
At present, the abstract extraction module is carried out location matches by keyword in document, then with the content around the keyword in the document as summary.For example, when keyword was three, these keywords had occurred in document 10 times, 12 times, 18 times, existing abstract extraction module with a part of keyword in this 30 place keyword and near content thereof as summary.
But, when 4 sentences that comprise whole keywords are arranged in the document, illustrate that these 4 sentences are very relevant with keyword, but above-mentioned prior art can not be learnt this situation, thereby extract the part of these 4 sentences as summary.In other words, the summary that existing abstract extraction method extracts and the correlation degree of keyword are lower, can not will embody with the content of keyword height correlation in the document.
Summary of the invention
In view of this, the present invention proposes a kind of abstract extraction method, in order to coming out as summary with the contents extraction of keyword matched.The invention allows for a kind of abstract extraction module.
The invention provides a kind of abstract extraction method, this method comprises:
In document,, from document, extract the current window content corresponding, calculate the weight of current window content, and preserve current window content and corresponding weight according to keyword with current window with in the process of setting the step-length moving window;
After the end of sliding,, take out the one or more windows contents corresponding as summary with higher weights according to windows content of being preserved and corresponding weight.
Described setting step-length is the least unit in the document content.
Before the weight of calculating the current window content according to keyword, further comprise and judge the step that whether comprises keyword in the current window content, and under the situation that is, calculate the weight of current window content.
The described step of calculating the weight of current window content according to keyword comprises: to the weight summation of each keyword weight as the current window content.
This method further comprises: according to the importance of each keyword, for the weight of each keyword multiply by a coefficient respectively, wherein said coefficient increases with the increase of importance; And/or, for the weight of current window content multiply by or add a coefficient, the order of the keyword of this coefficient order that keyword occurs in the current window content and input more near the time and/or in the current window content distance between the keyword big more more in short-term.
Before preserving current window content and corresponding weight, further comprise judge the current window content weight whether greater than with the weight of the overlapping windows content of current window, and under the situation that is, preserve the weight of current window content and correspondence.
The described taking-up one or more windows contents corresponding with higher weights comprise as the step of making a summary: according to weight corresponding windows content is sorted; According to the summary size, according to weight from big to small take out one or more windows contents in turn as summary.
The present invention also provides a kind of abstract extraction module, and this module comprises that storage unit, sliding unit, computing unit and summary form the unit, wherein:
Described storage unit is used to store document, windows content and corresponding weight;
Described sliding unit is used at document setting the step-length moving window, and in the process of sliding, extracts the current window content corresponding with current window and offer computing unit from document;
Described computing unit is used for calculating according to keyword the weight of current window content, and preserves current window content and corresponding weight in storage unit;
Described summary forms the unit and is used for taking out the one or more windows contents corresponding with higher weights as summary in the end back of sliding from described storage unit.
This abstract extraction module further comprises: the unit is set, is used to sliding unit that window size and sliding step are set; And/or order module is used for the windows content ordering corresponding to storage unit according to weight, so that described summary generation unit takes out the one or more windows contents corresponding with higher weights in turn as summary.
Described computing unit is further used for judging whether comprise keyword in the current window content, and calculates the weight of current window content under the situation that is; And/or, the weight that is further used for judging the current window content whether greater than with the weight of the overlapping windows content of current window, and under the situation that is, preserve current window content and corresponding weight.
From such scheme as can be seen, owing to the present invention's weight according to keyword calculation window content in the window sliding process, take out the one or more windows contents corresponding at last as summary with higher weights, can optimum response and the clip Text of keyword relation thereby take out, will offer the user as summary with the content that the keyword that the user imports is pressed close to most.
Description of drawings
Fig. 1 is the schematic flow sheet according to the abstract extraction method of the embodiment of the invention;
Fig. 2 is the structural representation according to the abstract extraction system of the embodiment of the invention.
Embodiment
For making the purpose, technical solutions and advantages of the present invention clearer, the present invention is described in more detail by the following examples.
Fig. 1 is the schematic flow sheet according to the abstract extraction method of the embodiment of the invention.As shown in Figure 1, the abstract extraction method of the embodiment of the invention may further comprise the steps:
Step 101 preestablishes window size and sliding step.The window here is meant the elementary cell of extracting content in document, generally includes a plurality of words.In general step-length is less than the size that equals window, otherwise just can not cover the full content of document.
Preferably, step-length is made as the least unit of content in the document, for example a Chinese character, an English word, a numeral etc.
Certainly, also can not carry out the judgement of step 103 and directly execution in step 104 and subsequent step thereof, can not be calculated as zero because do not comprise the weight of current window under the situation of keyword, thereby can not be used as summary.
Here be that a plurality of situations is described with keyword.When keyword when being single, can regard the simplification special case of a plurality of keywords as.
In simple terms, the weight of windows content equals the weight summation to each keyword.For simplicity, be that example is described with two keywords.The keyword of supposing user's input is " abstract extraction " two speech, and so, the weight of windows content adds the weight of " extraction " with regard to the weight that equals " summary ".And the weight of each keyword is relevant with the number of times that this keyword occurs in the current window content, and the number of times of appearance is many more, and the weight of keyword is big more.From top description as can be seen, the weight of windows content is relevant with the number of times that number and each keyword of the keyword that wherein occurs occur.When keyword when being single, only need according to the method described above the weight of the keyword weight as the current window content is got final product, do not relate to following content.
Further, because the importance of each keyword is different, can before summation, multiply by a coefficient for the weight of each keyword.For example, " summary " occurred 784 times in document, and " extraction " occurred 98 times, it is few that the importance of the keyword that occurrence number is many in document is generally less than in document occurrence number, so the weight for " summary " before summation multiply by a less coefficient, and be the bigger coefficient that multiply by of " extraction ", thereby distinguish both importance.
Further, when the weight of calculation window content, can also the weight of windows content be revised, for example add a coefficient and/or multiply by a coefficient according to the degree of correlation of a plurality of keywords in the windows content.For instance, the sequence consensus that the order that occurs when " summary " in the windows content, " extraction " and user import or more near the time, add a bigger coefficient and/or multiply by a bigger coefficient; The order that the order that occurs when " summary " in the windows content, " extraction " and user import is inconsistent or when keeping off more, adds a less coefficient and/or multiply by a less coefficient; " summary " in windows content and the distance of " extraction " add a bigger coefficient and/or multiply by a bigger coefficient more in short-term; When the distance of " summary " in the windows content and " extraction " is big, adds a less coefficient and/or multiply by a less coefficient.
In addition, current window content and corresponding weight are directly preserved in the also not judgement of execution in step 105.
Whether step 107 judge is slided and to be finished, and promptly whether reaches the document end, if, then execution in step 108 and subsequent step thereof; If not, then execution in step 102 and subsequent step thereof.
In addition, in the method for the embodiment of the invention, the also not ordering of execution in step 108, but in step 109,, take out the one or more windows contents corresponding as summary with higher weights according to windows content of being preserved and corresponding weight.
This flow process so far finishes.
Fig. 2 is the structural representation according to the abstract extraction module of the embodiment of the invention.With reference to Fig. 2, this abstract extraction device comprises that storage unit, sliding unit, computing unit and summary form the unit.
Wherein, storage unit is used to store document, windows content and corresponding weight.
Sliding unit is used at document setting the step-length moving window, and in the process of sliding, extracts the current window content corresponding with current window from document, then the current window content is offered computing unit.
Computing unit is used for calculating according to keyword the weight of current window content, and preserves current window content and corresponding weight in storage unit.Computing unit can be sued for peace as the weight of current window content by the weight to each keyword.Further, computing unit can also be according to the importance of each keyword, and for the weight of each keyword multiply by a coefficient respectively, wherein said coefficient increases with the increase of importance.In addition, computing unit can also multiply by or add a coefficient for the weight of current window content, the order of the order that this coefficient keyword in current window occurs and the keyword of input more near the time and/or in current window the distance between the keyword big more more in short-term.
In addition, computing unit can be further when receiving the current window content and not calculating the weight of current window, judge and whether comprise keyword in the current window content, and under the situation that is, calculate the weight of current window content, if do not comprise keyword in the current window content, then do not calculate the weight of current window content, receive and handle next current window content from sliding unit.
In addition, computing unit can be further when preserving the weight of current window content and correspondence, the weight of judging the current window content whether greater than with the weight of the overlapping windows content of current window, and under the situation that is, preserve current window content and corresponding weight, if the weight of current window content is not more than the weight with the overlapping windows content of current window, then abandon the current window content.
Summary forms the unit and is used for taking out the one or more windows contents corresponding with higher weights as summary in the end back of sliding from storage unit.
Continuation is with reference to Fig. 2, and the abstract extraction module of the embodiment of the invention can further include the unit is set, and this is provided with the unit and is used to sliding unit that window size and sliding step are set.
The abstract extraction module of the embodiment of the invention can further include sequencing unit, this sequencing unit is used for the windows content ordering corresponding to storage unit according to weight, so that the summary generation unit takes out the one or more windows contents corresponding with higher weights in turn as summary.
The above only is preferred embodiment of the present invention, and is in order to restriction the present invention, within the spirit and principles in the present invention not all, any modification of being done, is equal to replacement, improvement etc., all should be included within protection scope of the present invention.
Claims (10)
1, a kind of abstract extraction method is characterized in that, this method comprises:
In document,, from document, extract the current window content corresponding, calculate the weight of current window content, and preserve current window content and corresponding weight according to keyword with current window with in the process of setting the step-length moving window;
After the end of sliding,, take out the one or more windows contents corresponding as summary with higher weights according to windows content of being preserved and corresponding weight.
2, method according to claim 1 is characterized in that, described setting step-length is the least unit in the document content.
3, method according to claim 1, it is characterized in that, before the weight of calculating the current window content according to keyword, further comprise and judge the step that whether comprises keyword in the current window content, and under the situation that is, calculate the weight of current window content.
4, method according to claim 1 is characterized in that, the described step of calculating the weight of current window content according to keyword comprises: to the weight summation of each keyword weight as the current window content.
5, method according to claim 4 is characterized in that, this method further comprises:
According to the importance of each keyword, for the weight of each keyword multiply by a coefficient respectively, wherein said coefficient increases with the increase of importance; And/or,
For the weight of current window content multiply by or add a coefficient, the order of the keyword of this coefficient order that keyword occurs in the current window content and input more near the time and/or in the current window content distance between the keyword big more more in short-term.
6, method according to claim 1, it is characterized in that, before preserving current window content and corresponding weight, further comprise judge the current window content weight whether greater than with the weight of the overlapping windows content of current window, and under the situation that is, preserve the weight of current window content and correspondence.
7, method according to claim 1 is characterized in that, the described taking-up one or more windows contents corresponding with higher weights comprise as the step of making a summary:
According to the windows content ordering of weight to correspondence;
According to the summary size, according to weight from big to small take out one or more windows contents in turn as summary.
8, a kind of abstract extraction module is characterized in that, this abstract extraction module comprises that storage unit, sliding unit, computing unit and summary form the unit, wherein:
Described storage unit is used to store document, windows content and corresponding weight;
Described sliding unit is used at document setting the step-length moving window, and in the process of sliding, extracts the current window content corresponding with current window and offer computing unit from document;
Described computing unit is used for calculating according to keyword the weight of current window content, and preserves current window content and corresponding weight in storage unit;
Described summary forms the unit and is used for taking out the one or more windows contents corresponding with higher weights as summary in the end back of sliding from described storage unit.
9, abstract extraction module according to claim 8 is characterized in that, this abstract extraction module further comprises:
The unit is set, is used to sliding unit that window size and sliding step are set; And/or,
Order module is used for the windows content ordering corresponding to storage unit according to weight, so that described summary generation unit takes out the one or more windows contents corresponding with higher weights in turn as summary.
10, abstract extraction module according to claim 8 is characterized in that, described computing unit is further used for judging whether comprise keyword in the current window content, and calculates the weight of current window content under the situation that is; And/or,
The weight that is further used for judging the current window content whether greater than with the weight of the overlapping windows content of current window, and under the situation that is, preserve current window content and corresponding weight.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB200710109499XA CN100492366C (en) | 2007-06-28 | 2007-06-28 | Method and module for extracting summary |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB200710109499XA CN100492366C (en) | 2007-06-28 | 2007-06-28 | Method and module for extracting summary |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101075260A CN101075260A (en) | 2007-11-21 |
CN100492366C true CN100492366C (en) | 2009-05-27 |
Family
ID=38976311
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB200710109499XA Active CN100492366C (en) | 2007-06-28 | 2007-06-28 | Method and module for extracting summary |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100492366C (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102314448B (en) * | 2010-07-06 | 2013-12-04 | 株式会社理光 | Equipment for acquiring one or more key elements from document and method |
CN104091058A (en) * | 2014-06-27 | 2014-10-08 | 北京君和信达科技有限公司 | Safety inspection conclusion submitting method and device |
CN105808570A (en) * | 2014-12-30 | 2016-07-27 | 北京奇虎科技有限公司 | Method and device for providing abstract searching service |
CN105808566A (en) * | 2014-12-30 | 2016-07-27 | 北京奇虎科技有限公司 | Method and device for extracting abstracts from webpages on basis of search words |
CN105808552A (en) * | 2014-12-30 | 2016-07-27 | 北京奇虎科技有限公司 | Method and device for extracting abstract from webpage based on slide window |
CN107451302B (en) * | 2017-09-22 | 2020-08-28 | 深圳大学 | Modeling method and system based on position top-k keyword query under sliding window |
CN108628833B (en) * | 2018-05-11 | 2021-01-22 | 北京三快在线科技有限公司 | Method and device for determining summary of original content and method and device for recommending original content |
CN109522402A (en) * | 2018-10-22 | 2019-03-26 | 国家电网有限公司 | A kind of abstract extraction method and storage medium based on power industry characteristic key words |
-
2007
- 2007-06-28 CN CNB200710109499XA patent/CN100492366C/en active Active
Non-Patent Citations (2)
Title |
---|
自动文摘的四种主要方法. 刘挺,王开铸.情报学报,第18卷第1期. 1999 |
自动文摘的四种主要方法. 刘挺,王开铸.情报学报,第18卷第1期. 1999 * |
Also Published As
Publication number | Publication date |
---|---|
CN101075260A (en) | 2007-11-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN100492366C (en) | Method and module for extracting summary | |
CN105955976B (en) | A kind of automatic answering system and method | |
US9106698B2 (en) | Method and server for intelligent categorization of bookmarks | |
CN101334768B (en) | Method and system for eliminating ambiguity for word meaning by computer, and search method | |
CN109582704B (en) | Recruitment information and the matched method of job seeker resume | |
US8914363B2 (en) | Disambiguating tags in network based multiple user tagging systems | |
CN102725759A (en) | Semantic table of contents for search results | |
US20130173610A1 (en) | Extracting Search-Focused Key N-Grams and/or Phrases for Relevance Rankings in Searches | |
CN106776567B (en) | Internet big data analysis and extraction method and system | |
CN101458708B (en) | Searching result clustering method and device | |
CN102855252B (en) | A kind of need-based data retrieval method and device | |
WO2006108069A2 (en) | Searching through content which is accessible through web-based forms | |
CN102138142A (en) | Dictionary suggestions for partial user entries | |
CN105630940B (en) | A kind of information retrieval method based on readable index | |
CN106484797A (en) | Accident summary abstracting method based on sparse study | |
US20040158558A1 (en) | Information processor and program for implementing information processor | |
US7181688B1 (en) | Device and method for retrieving documents | |
JP5718405B2 (en) | Utterance selection apparatus, method and program, dialogue apparatus and method | |
US8799268B2 (en) | Consolidating tags | |
CN108345694B (en) | Document retrieval method and system based on theme database | |
CN110008312A (en) | A kind of document writing assistant implementation method, system and electronic equipment | |
CN114443847A (en) | Text classification method, text processing method, text classification device, text processing device, computer equipment and storage medium | |
CN110059253A (en) | A kind of sort method and system and equipment based on natural language analysis | |
CN102122296B (en) | Search result clustering method and device | |
CN107315735B (en) | Method and equipment for note arrangement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |