CN101751425A - Method for acquiring document set abstracts and device - Google Patents

Method for acquiring document set abstracts and device Download PDF

Info

Publication number
CN101751425A
CN101751425A CN200810239344A CN200810239344A CN101751425A CN 101751425 A CN101751425 A CN 101751425A CN 200810239344 A CN200810239344 A CN 200810239344A CN 200810239344 A CN200810239344 A CN 200810239344A CN 101751425 A CN101751425 A CN 101751425A
Authority
CN
China
Prior art keywords
sentence
weights
importance value
document
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200810239344A
Other languages
Chinese (zh)
Inventor
万小军
杨建武
肖建国
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd
Peking University
Peking University Founder Group Co Ltd
Original Assignee
BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd
Peking University
Peking University Founder Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd, Peking University, Peking University Founder Group Co Ltd filed Critical BEIJING FOUNDER E-GOVERNMENT INFORMATION TECHNOLOGY Co Ltd
Priority to CN200810239344A priority Critical patent/CN101751425A/en
Publication of CN101751425A publication Critical patent/CN101751425A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for acquiring document set abstracts and a device for improving the acquiring effect of the document set abstracts. The method extracts each sentence included in each document in the document set for forming a sentence set, the importance weighted value of each sentence in the sentence set is determined on the basis of the text similarity between documents in the document set and between the sentences in the sentence set, and the document set abstracts are formed by selecting the sentences in the specified number according to the determined importance weighted values in accordance with the selection sequence from higher importance weighted values to lower importance weighted values.

Description

Method for acquiring document set abstracts and device
Technical field
The present invention relates to spoken and written languages process field and technical field of information retrieval, relate in particular to a kind of method for acquiring document set abstracts and device.
Background technology
Along with the quick promotion and application of Internet technology, the technology of obtaining of document set abstracts has been widely used in the searching field of text/web site contents.Document set abstracts obtains technology and is meant: automatically from a document sets that comprises many pieces of documents, obtain the information that reflection the document is concentrated the document content main points by computer system.This technology can provide document sets brief and concise content description for the user, provides convenience for the user consults the large volume document content.For example, the basic realization principle of the press service that certain the Internet portal website provided is the various news informations on the collection network at first, and according to theme and Doctype, the news information of collecting is sorted out, form a plurality of document sets, the user use the technology of obtaining of above-mentioned document set abstracts to obtain the summary of each document sets, so that can browse interested news fast and easily.
Existing method for acquiring document set abstracts mainly is divided into two classes: extract the method for acquiring document set abstracts of (Extraction) and based on the method for acquiring document set abstracts of sentence generation (Abstraction) based on sentence.Wherein, the realization principle of the method for acquiring document set abstracts that extracts based on sentence is to every piece in document sets document, cut apart by sentence, according to predetermined sentence weighted value measurement index, for example sentence position, word class bunch, theme signature, keyword frequency/inverted order index frequency (TF/IDF) etc., determine to cut apart the weights of importance value of each sentence in document sets that obtains, at least one sentence of selection weights of importance value maximum forms the summary of described document sets.Realization principle based on the method for acquiring document set abstracts of sentence generation is according to natural language understanding technology, each sentence in the document sets is carried out the syntax and semantics analysis, and use information extraction or natural language generation technique to produce new sentence, thereby obtain the summary of described document sets.From above description as can be seen, the summary of the document sets that method for acquiring document set abstracts obtained that extracts based on sentence, form by the existing sentence that document comprised in the document sets, need not analyze institute's content information in the document sets by the deep layer natural language understanding technology of complexity, therefore the method for acquiring document set abstracts that extracts based on sentence is compared with the method for acquiring document set abstracts based on sentence generation, realizes comparatively simple.
During the weights of importance value of existing method for acquiring document set abstracts each sentence in determining document sets that extracts based on sentence, remove the mode based on the sentence weighted value measurement index of being scheduled to of above-mentioned introduction, also can use method based on graph model.For example, (author is I.Mani and E.Bloedorn to article Summarizing Similarities andDifferences Among Related Documents, be published in the periodical Information Retrieval that published in 2000) method of a kind of WebSumm by name disclosed, the WebSumm method is utilized the figure link model, wherein each sentence in the document sets is represented on the summit in the figure link model respectively, the importance of sentence of supposing to be connected with other summit many more summit representatives is high more, come determining the weights of importance value of the sentence in the document sets with this, thereby obtain the summary of document sets.
In the method for the weights of importance value of determining each sentence in the document sets based on graph model of above-mentioned introduction, only considered the relation between the sentence in the document sets, do not consider of the influence of the relation of sentence and document to the importance of sentence, suppose that promptly the importance of all documents all equates in the document sets, yet the importance of different document is different in the document sets usually, the difference of importance that existing method for acquiring document set abstracts based on graph model can not reflect different document in the document sets is to obtaining document set abstracts result's influence, thus document set abstracts obtain poor effect.
Summary of the invention
The embodiment of the invention provides a kind of method for acquiring document set abstracts and device, in order to solve the problem that the existing mode document set abstracts that obtains document set abstracts based on graph model obtains poor effect.
The technical scheme that the embodiment of the invention provides is as follows:
A kind of method for acquiring document set abstracts comprises:
Extract each sentence that comprises in each document in the document sets, form the sentence set;
Based on the text similarity between the sentence in document in the document sets and the sentence set, determine the weights of importance value of each sentence in the sentence set;
According to the weights of importance value of determining,, select the sentence of defined amount to form document set abstracts according to weights of importance value selecting sequence from high to low.
A kind of document set abstracts deriving means comprises:
Sentence set extraction unit is used for extracting each sentence that comprises in each document of document sets, forms the sentence set;
Sentence weights of importance value determining unit is used for determining the weights of importance value of each sentence in the sentence set based on the document of document sets and the text similarity between the sentence in the sentence set;
The summary determining unit is used for the weights of importance value definite according to sentence weights of importance value determining unit, according to weights of importance value selecting sequence from high to low, selects the sentence of defined amount to form document set abstracts.
The multi-document summary acquisition methods that the embodiment of the invention proposes, utilized the relation between the sentence and document in the document sets, considered of the influence of the difference of different document importance in the document sets to sentence weights of importance value, therefore can determine the weights of importance value of sentence in the document sets more accurately, and select the high sentence of weights of importance value to form document set abstracts, therefore can obtain better document set abstracts obtains effect.
Description of drawings
Fig. 1 is the main realization principle flow chart of the embodiment of the invention;
Fig. 2 is the synoptic diagram of document sets bigraph (bipartite graph) in the embodiment of the invention;
The structural representation of the document set abstracts deriving means that Fig. 3 provides for the embodiment of the invention;
Fig. 4 is the structural representation of sentence weights of importance value determining unit in the embodiment of the invention;
Fig. 5 determines the structural representation of subelement for sentence weights of importance value in the embodiment of the invention;
Fig. 6 is the structural representation of summary determining unit in the embodiment of the invention;
Fig. 7 is the structural representation that the weights of importance value is adjusted subelement in the embodiment of the invention.
Embodiment
Because existing method for acquiring document set abstracts based on graph model can not reflect the influence of the importance of sentence place document to sentence weights of importance value, thus document set abstracts obtain poor effect.The embodiment of the invention makes up the bigraph (bipartite graph) model that comprises sentence and document relationships information by when setting up graph model, has solved the problems referred to above, provides better document set abstracts to obtain scheme.
Be explained in detail to the main realization principle of embodiment of the invention technical scheme, embodiment and to the beneficial effect that should be able to reach below in conjunction with each accompanying drawing.
As shown in Figure 1, the main realization principle process of the embodiment of the invention is as follows:
Step 10 makes up the document sets bigraph (bipartite graph) model that comprises the relation information between sentence and the document;
Step 20, the weights of importance value of each sentence in the sentence set in the constructed document sets bigraph (bipartite graph) model of determining step 10;
Step 30 selects the high sentence of weights of importance value to form document set abstracts.
In step 30, according to the similarity value between the sentence in the sentence set of the constructed document sets bigraph (bipartite graph) model of step 10, the weights of importance value of each sentence that step 20 is obtained is adjusted, in the similar sentence of content, the weights of importance value that only keeps one of them sentence is constant, reduce the weights of importance value of other sentence, it is low to guarantee to form between the sentence of document set abstracts redundance like this.
To introduce an embodiment in detail and come the main realization principle of the inventive method is explained in detail and illustrates according to foregoing invention principle of the present invention below.
The first step, the bigraph (bipartite graph) model of structure document sets comprises the relation information between sentence and the document in this model, and detailed process is as follows:
Use D={d j| 1≤j≤m} represents document sets, wherein d jJ document in the expression document sets, m is a natural number, the quantity of document in the expression document sets.
Each document in the document sets is all carried out subordinate sentence handle, obtain forming the sentence S set={ s of all documents in the document sets i| 1≤i≤n}, wherein, s iI sentence in the expression sentence S set, n is a natural number, the quantity of sentence in the set of expression sentence.
Sentence set and document sets as two vertex sets of bigraph (bipartite graph) model, be please refer to accompanying drawing 2, and the arbitrary sentence of representative and represent limit of interpolation between two summits of arbitrary document in the bigraph (bipartite graph) model obtains the set E on limit SD={ e Ij| s i∈ S, d j∈ D}, wherein e IjExpression connects the summit of representing i sentence and the limit of representing the summit of j document.Limit e IjHas similarity value w Ij, this similarity value is used to describe sentence s iWith document d jThe text similarity degree, can determine by text information processing field cosine formula (Cosine) commonly used usually.Describing the adjacency matrix that concerns between the summit of all sentence set of this bigraph (bipartite graph) model and document sets correspondence is L=(w Ij) N * m
The bigraph (bipartite graph) model that obtains through above-mentioned processing can be expressed as G=<S, D, E SD.
Second step, according to the bigraph (bipartite graph) model that the first step is obtained, determine the weights of importance value of each sentence in the sentence set, detailed process is as follows:
It is all identical when the weights of importance value of arbitrary sentence is initial in the hypothetical sentence subclass, and it is also identical when the weights of importance value of arbitrary document is initial in the supposition document sets, when for example the weights of importance value of each sentence is initial in the sentence set in the present embodiment is 1, i.e. AuthScore (0)(s i)=1, the weights of importance value of each document is initially 1 in the document sets, i.e. HubScore (0)(d j)=1, wherein subscript is represented iterative computation wheel number;
According to following iterative computation formula, determine every take turns each sentence in the sentence set and the weights of importance value of each document in the document sets after the iteration, each sentence in sentence set and the weights of importance value of each document in the document sets respectively with on till the weights of importance value that obtains after taking turns iteration equates, promptly up to AuthScore (t+1)(s i)=AuthScore (t)(s i), and HubScore (t+1)(d j)=HubScore (t)(d j) till,
AuthScore ( t + 1 ) ( s i ) = Σ d j ∈ D w ij × HubScore ( t ) ( d j ) ,
HubScore ( t + 1 ) ( d j ) = Σ s i ∈ S w ij × AuthScore ( t ) ( s i ) ;
Wherein, t is a natural number, AuthScore (t+1)(s i) and HubScore (t+1)(d j) represent sentence s respectively iWith document d jWeights of importance value behind t+1 wheel interative computation, AuthScore (t)(s i) and HubScore (t)(d j) represent sentence s respectively iWith document d jLast round of, i.e. weights of importance value behind the t wheel interative computation.
Represent above-mentioned iterative computation formula with matrix form, be specially:
A (t+1)=LH (t)
H (t+1)=L TA (t)
Wherein, A=[AuthScore (s i)] N * 1And H=[HubScore (d j)] M * 1Represent sentence weights of importance value vector sum importance of documents weighted value vector respectively.
Whenever, take turns the sentence weights of importance value vector sum importance of documents weighted value vector that interative computation obtains and carry out standardization processing what obtain by above-mentioned steps, so that the weights of importance value sum of all sentences is 1 in the sentence set, the weights of importance value sum of all documents is 1 in the document sets, promptly
A (t+1)=A (t+1)/‖A (t+1)1
H (t+1)=H (t+1)/‖H (t+1)1
Wherein, ‖ A (t+1)1With ‖ H (t+1)1Represent vectorial A respectively (t+1)And H (t+1)The weights of importance value sum of middle all elements.
The basic thought of above-mentioned interative computation is to regard sentence in the sentence set and the relation between the document in the document sets as the relation of the Authority-Hub between the webpage in the networked information retrieval field that is similar to, and utilize the HITS iterative algorithm to find the solution, the HITS iterative algorithm is based on following two hypothesis:
A, an important documents are associated with more important sentences usually;
B, an important sentences are associated with more important documents usually.
The 3rd step, according to the text similarity value between each sentence, the second weights of importance value that goes on foot all sentences in the sentence set that obtains is adjusted, choose the low sentence of weights of importance value height and text redundancy and form document set abstracts.Concrete implementation method can have multiple, and specific implementation process in the present embodiment is as follows:
(1) obtains sentence relational matrix M=(M Ij) N * n, and this matrix standardized obtain matrix
Figure G2008102393442D0000061
M wherein IjAny two sentence s among the expression sentence collection S iAnd s jBetween the text similarity value, and determine sentence s in the above-mentioned first step iWith document d jSimilarity value w IjMethod similar, can determine by cosine formula, after this M is carried out following standardization so that each row sum is 1, i.e. any sentence s in the sentence set iWith the similarity value sum of other sentence in the sentence set be 1, obtain matrix
Figure G2008102393442D0000062
Figure G2008102393442D0000071
(2) two set A=φ of initialization (empty set), B={s i| i=1,2 ... n}, the final weights of importance value RankScore (s of each sentence i) initial value be the weights of importance value AuthScore (s that obtains in above-mentioned second step i), i.e. RankScore (s i)=AuthScore (s i);
(3) element among the pair set B carries out descending sort according to final weights of importance value;
(4) suppose s iBe the most forward sentence of ordering in the sequence that is obtained in the step (3), i.e. first sentence in the sequence is with s iTransfer to set A from set B, and the residue sentence among the pair set B, i.e. s j(j ≠ i) carry out redundancy punishment according to following rule:
RankScore ( s j ) = RankScore ( s j ) - ω × M ~ ji × AuthScore ( s i ) ,
Wherein, ω>0, ω is the punishment degree factor, ω is big more to show that redundant punishment is strong more, in the present embodiment, establishes ω=10;
Figure G2008102393442D0000073
Be the sentence relational matrix after the standardization that obtains in (1);
(5) circulation execution in step (3) and (4) are till B=φ;
(6) select n sentence of sentence weights of importance value maximum to form summary from set A, wherein n is a natural number.
The multi-document summary acquisition methods that the embodiment of the invention proposes, utilized the relation between the sentence and document in the document sets, considered of the influence of the difference of different document importance in the document sets to sentence weights of importance value, therefore with prior art when the definite sentence weights of importance value, only consider that the technical scheme that concerns between the sentence compares, can determine the weights of importance value of sentence in the set of document sets sentence more accurately, and select the high sentence of weights of importance value to form document set abstracts, therefore can obtain better document set abstracts obtains effect.
In order to verify the validity of the method that the embodiment of the invention proposes, the evaluation and test data and the task that adopt document to understand meeting (DUC, Document Understanding Conference) are carried out following test to the method that the present invention proposes.Selection comprises the DUC2001 of 30 document sets and comprises the DUC2002 data of 59 document sets, require document set abstracts that different summary acquisition methods obtains in 100 words, and the document set abstracts that obtains and the document set abstracts that manually obtains compared, estimate the effect of summary acquisition methods.Usually use the ROUGE evaluating system to weigh the validity of summary acquisition methods, comprise three evaluation index ROUGE-1, ROUGE-2 and ROUGE-W, the effect of the big more descriptive abstract acquisition methods of the numerical value of above-mentioned three indexs is good more.The method that the present invention proposes and existing based on the graph model method that concerns between the sentence evaluation result as shown in Table 1 and Table 2.
Table 1: the summary on DUC2001 evaluation and test data obtains the result
System ??ROUGE-1 ??ROUGE-2 ??ROUGE-W
The method that the present invention proposes ??0.37744 ??0.06966 ??0.11252
Existing method ??0.35474 ??0.05733 ??0.10667
Table 2: the summary on DUC2002 evaluation and test data obtains the result
System ??ROUGE-1 ??ROUGE-2 ??ROUGE-W
The method that the present invention proposes ??0.38569 ??0.08519 ??0.12500
Existing method ??0.37510 ??0.07973 ??0.12198
Correspondingly, the embodiment of the invention also provides a kind of document set abstracts deriving means, please refer to accompanying drawing 3, and this device comprises sentence set extraction unit 310, sentence weights of importance value determining unit 320 and summary determining unit 330, wherein,
Sentence set extraction unit 310 is used for extracting each sentence that comprises in each document of document sets, forms the sentence set, in the specific implementation, can carry out subordinate sentence to the document in the document sets and handle, and extracts each sentence that comprises in each document of document sets;
Sentence weights of importance value determining unit 320 is used for determining the weights of importance value of each sentence in the sentence set based on the document of document sets and the text similarity between the sentence in the sentence set;
Summary determining unit 330 is used for the weights of importance value determined according to sentence weights of importance value determining unit 320, according to weights of importance value selecting sequence from high to low, selects the sentence of defined amount to form document set abstracts.
Please refer to accompanying drawing 4, sentence weights of importance value determining unit comprises that text similarity determines that subelement 410 and sentence weights of importance value determine subelement 420, wherein,
Text similarity is determined subelement 410, be used for determining the text similarity between each sentence in each document of document sets and the sentence set, when specific implementation, use cosine formula to determine the text similarity between each sentence during each document and sentence are gathered in the document sets;
Sentence weights of importance value is determined subelement 420, is used for determining the text similarity that subelement 410 is determined according to text similarity, by the interative computation mode, determines the weights of importance value of each sentence in the sentence set.
Please refer to accompanying drawing 5, sentence weights of importance value determines that subelement comprises that interative computation subelement 510, interative computation finish judgement subelement 520 and sentence weights of importance value is determined subelement 530, wherein,
Interative computation subelement 510 is used for according to following account form, determines the sentence weights of importance value that each iteration obtains:
AuthScore ( t + 1 ) ( s i ) = Σ d j ∈ D w ij × HubScore ( t ) ( d j ) ,
HubScore ( t + 1 ) ( d j ) = Σ s i ∈ S w ij × AuthScore ( t ) ( s i ) ;
Wherein, t is a natural number, and t+1 represents this interative computation, and t represents the last iteration computing;
AuthScore (t+1)(s i) be illustrated in this interative computation i sentence s in the sentence set iThe weights of importance value;
HubScore (t+1)(d j) be illustrated in this interative computation j document d in the document sets jThe weights of importance value;
AuthScore (t)(s i) be illustrated in the last iteration computing i sentence s in the sentence set iThe weights of importance value;
HubScore (t)(d j) be illustrated in the last iteration computing, divide document to concentrate j document d jThe weights of importance value;
w IjI sentence s in the set of expression sentence iWith j document d in the document sets jThe text similarity degree;
Interative computation finishes to judge subelement 520, be used for determining interative computation subelement 510 behind interative computation last time, the weights of importance value of each document in the weights of importance value of each sentence and the document sets in the sentence set, respectively with the last iteration computing after, when the weights of importance value of each document equated in the weights of importance value of each sentence and the document sets in the sentence set, termination of iterations operator unit 510 carries out interative computation to be handled;
Sentence weights of importance value is determined subelement 530, be used for when interative computation finishes to judge that the interative computation that carries out subelement 520 termination of iterations operator unit 510 is handled, with the weights of importance value of each sentence in the sentence set that obtains behind the interative computation subelement 510 last interative computations, as the weights of importance value of each sentence in the sentence set of asking for.
Please refer to accompanying drawing 6, the summary determining unit comprises that the weights of importance value is adjusted subelement 610, document set abstracts obtains subelement 620, wherein,
The weights of importance value is adjusted subelement 610, is used for adjusting the weights of importance value of each sentence according to the text similarity value between each sentence;
Document set abstracts obtains subelement 620, is used for adjusting subelement 610 adjusted weights of importance values selecting sequence from high to low according to the weights of importance value, selects the sentence of defined amount to form document set abstracts.
Please refer to accompanying drawing 7, the weights of importance value is adjusted subelement and is comprised that order module 710, sentence repeat to select module 720 and weights of importance value determination module 730, wherein,
Order module 710 is used for according to weights of importance value order from high to low, and each sentence in the distich subclass sorts, and obtains the sentence sequence;
Sentence repeats to select module 720, is used for the sentence sequence that obtains in order module 710, repeats following processing, all sentences in the sentence sequence are all selected go out till:
Select the highest sentence of weights of importance value, at each sentence in the residue sentence in the sequence, weights of importance value with this sentence is adjusted into the weights of importance value of this sentence and the difference of penalty value respectively, described penalty value is the weights of importance value three's of the text similarity value of penalty factor, this sentence and the selected sentence that goes out and this sentence a product, and wherein said penalty factor is greater than 0;
Weights of importance value determination module 730 is used for sentence is repeated to select the weights of importance value of the weights of importance value of all sentences that module 720 selects as adjusted all sentences.
Obviously, those skilled in the art can carry out various changes and modification to the present invention and not break away from the spirit and scope of the present invention.Like this, if of the present invention these are revised and modification belongs within the scope of claim of the present invention and equivalent technologies thereof, then the present invention also is intended to comprise these changes and modification interior.

Claims (11)

1. a method for acquiring document set abstracts is characterized in that, comprising:
Extract each sentence that comprises in each document in the document sets, form the sentence set;
Based on the text similarity between the sentence in document in the document sets and the sentence set, determine the weights of importance value of each sentence in the sentence set;
According to the weights of importance value of determining,, select the sentence of defined amount to form document set abstracts according to weights of importance value selecting sequence from high to low.
2. the method for claim 1 is characterized in that, based on the text similarity between the sentence in each document in the document sets and the sentence set, determines the weights of importance value of each sentence in the sentence set, specifically comprises:
Determine the text similarity between the sentence in the set of document in the document sets and sentence; And
According to the text similarity between each sentence in document in the document sets and the sentence set,, determine the weights of importance value of each sentence in the sentence set by the interative computation mode.
3. method as claimed in claim 2 is characterized in that, based on the interative computation mode, determines the weights of importance value of each sentence in the sentence set, and concrete computation process is as follows:
AuthScor e ( t + 1 ) ( s i ) = Σ d j ∈ D w ij × HubScor e ( t ) ( d j ) ,
HubScore ( t + 1 ) ( d j ) = Σ s i ∈ S w ij × A uthScore ( t ) ( s i ) ;
Wherein, t is a natural number, and t+1 represents this interative computation, and t represents the last iteration computing;
AuthScore (t+1)(s i) be illustrated in this interative computation i sentence s in the sentence set iThe weights of importance value;
HubScore (t+1)(d j) be illustrated in this interative computation j document d in the document sets jThe weights of importance value;
AuthScore (t)(s i) be illustrated in the last iteration computing i sentence s in the sentence set iThe weights of importance value;
HubScore (t)(d j) be illustrated in the last iteration computing, divide document to concentrate j document d jThe weights of importance value;
w IjI sentence s in the set of expression sentence iWith j document d in the document sets jThe text similarity degree;
Repeat above-mentioned each interative computation process, up to behind interative computation last time, the weights of importance value of each document in the weights of importance value of each sentence and the document sets in the sentence set, respectively with the last iteration computing after, in the sentence set in the weights of importance value of each sentence and the document sets weights of importance value of each document equate to stop;
After interative computation stopped, behind last interative computation, the weights of importance value of each sentence was as the weights of importance value of each sentence in the sentence set of asking in the sentence set.
4. the method for claim 1 is characterized in that, according to the weights of importance value of each sentence of determining, according to weights of importance value selecting sequence from high to low, selects the sentence of defined amount to form document set abstracts, specifically comprises:
According to the text similarity value between each sentence, adjust the weights of importance value of each sentence;
According to above-mentioned adjusted weights of importance value selecting sequence from high to low, select the sentence of defined amount to form document set abstracts.
5. method as claimed in claim 4 is characterized in that, and is described in the sentence sequence, according to the text similarity value between each sentence, adjusts the weights of importance value of each sentence, specifically comprises:
According to weights of importance value order from high to low, each sentence in the distich subclass sorts, and obtains the sentence sequence;
In the sentence sequence, repeat following processing, all sentences in the sentence sequence are all selected go out till:
Select the highest sentence of weights of importance value, at each sentence in the residue sentence in the sequence, weights of importance value with this sentence is adjusted into the weights of importance value of this sentence and the difference of penalty value respectively, described penalty value is the weights of importance value three's of the text similarity value of penalty factor, this sentence and the selected sentence that goes out and this sentence a product, and wherein said penalty factor is greater than 0; With the weights of importance value of all sentences of selecting weights of importance value as adjusted all sentences.
6. method as claimed in claim 5 is characterized in that, described penalty factor is 10.
7. a document set abstracts deriving means is characterized in that, comprising:
Sentence set extraction unit is used for extracting each sentence that comprises in each document of document sets, forms the sentence set;
Sentence weights of importance value determining unit is used for determining the weights of importance value of each sentence in the sentence set based on the document of document sets and the text similarity between the sentence in the sentence set;
The summary determining unit is used for the weights of importance value definite according to sentence weights of importance value determining unit, according to weights of importance value selecting sequence from high to low, selects the sentence of defined amount to form document set abstracts.
8. device as claimed in claim 7 is characterized in that, described sentence weights of importance value determining unit specifically comprises:
Text similarity is determined subelement, is used for determining the text similarity between each sentence in each document of document sets and the sentence set;
Sentence weights of importance value is determined subelement, is used for determining the text similarity that subelement is determined according to text similarity, by the interative computation mode, determines the weights of importance value of each sentence in the sentence set.
9. device as claimed in claim 8 is characterized in that, described sentence weights of importance value determines that subelement specifically comprises:
The interative computation subelement is used for according to following account form, determines the sentence weights of importance value that each iteration obtains:
AuthScor e ( t + 1 ) ( s i ) = Σ d j ∈ D w ij × HubScor e ( t ) ( d j ) ,
HubScore ( t + 1 ) ( d j ) = Σ s i ∈ S w ij × A uthScore ( t ) ( s i ) ;
Wherein, t is a natural number, and t+1 represents this interative computation, and t represents the last iteration computing;
AuthScore (t+1)(s i) be illustrated in this interative computation i sentence s in the sentence set iThe weights of importance value;
HubScore (t+1)(d j) be illustrated in this interative computation j document d in the document sets jThe weights of importance value;
AuthScore (t)(s i) be illustrated in the last iteration computing i sentence s in the sentence set iThe weights of importance value;
HubScore (t)(d j) be illustrated in the last iteration computing, divide document to concentrate j document d jThe weights of importance value;
w IjI sentence s in the set of expression sentence iWith j document d in the document sets jThe text similarity degree;
Interative computation finishes to judge subelement, be used for determining the interative computation subelement behind interative computation last time, the weights of importance value of each document in the weights of importance value of each sentence and the document sets in the sentence set, respectively with the last iteration computing after, when the weights of importance value of each document equated in the weights of importance value of each sentence and the document sets in the sentence set, termination of iterations operator unit carries out interative computation to be handled;
Sentence weights of importance value is determined subelement, be used for when interative computation finishes to judge the interative computation processing of carrying out subelement termination of iterations operator unit, with the weights of importance value of each sentence in the sentence set that obtains behind the last interative computation of interative computation subelement, as the weights of importance value of each sentence in the sentence set of asking for.
10. device as claimed in claim 7 is characterized in that, described summary determining unit specifically comprises:
The weights of importance value is adjusted subelement, is used for adjusting the weights of importance value of each sentence according to the text similarity value between each sentence;
Document set abstracts obtains subelement, is used for adjusting the adjusted weights of importance value of subelement selecting sequence from high to low according to the weights of importance value, selects the sentence of defined amount to form document set abstracts.
11. device as claimed in claim 10 is characterized in that, described weights of importance value is adjusted subelement and is specifically comprised:
Order module is used for according to weights of importance value order from high to low, and each sentence in the distich subclass sorts, and obtains the sentence sequence;
Sentence repeats to select module, is used for the sentence sequence that obtains in order module, repeats following processing, all sentences in the sentence sequence are all selected go out till:
Select the highest sentence of weights of importance value, at each sentence in the residue sentence in the sequence, weights of importance value with this sentence is adjusted into the weights of importance value of this sentence and the difference of penalty value respectively, described penalty value is the weights of importance value three's of the text similarity value of penalty factor, this sentence and the selected sentence that goes out and this sentence a product, and wherein said penalty factor is greater than 0;
Weights of importance value determination module is used for sentence is repeated to select the weights of importance value of the weights of importance value of all sentences that module selects as adjusted all sentences.
CN200810239344A 2008-12-10 2008-12-10 Method for acquiring document set abstracts and device Pending CN101751425A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810239344A CN101751425A (en) 2008-12-10 2008-12-10 Method for acquiring document set abstracts and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810239344A CN101751425A (en) 2008-12-10 2008-12-10 Method for acquiring document set abstracts and device

Publications (1)

Publication Number Publication Date
CN101751425A true CN101751425A (en) 2010-06-23

Family

ID=42478416

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810239344A Pending CN101751425A (en) 2008-12-10 2008-12-10 Method for acquiring document set abstracts and device

Country Status (1)

Country Link
CN (1) CN101751425A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102111505A (en) * 2011-03-04 2011-06-29 中山大学 Short message prompting display method for mobile terminal
WO2015124096A1 (en) * 2014-02-22 2015-08-27 Tencent Technology (Shenzhen) Company Limited Method and apparatus for determining morpheme importance analysis model
CN105706079A (en) * 2013-10-31 2016-06-22 隆沙有限公司 Topic-wise collaboration integration
CN108009135A (en) * 2016-10-31 2018-05-08 深圳市北科瑞声科技股份有限公司 The method and apparatus for generating documentation summary
CN108763206A (en) * 2018-05-22 2018-11-06 南京邮电大学 A method of quicksort is carried out to single text keyword
WO2018233647A1 (en) * 2017-06-22 2018-12-27 腾讯科技(深圳)有限公司 Abstract generation method, device and computer device and storage medium
CN109325109A (en) * 2018-08-27 2019-02-12 中国人民解放军国防科技大学 Attention encoder-based extraction type news abstract generating device
CN110781227A (en) * 2019-10-30 2020-02-11 中国联合网络通信集团有限公司 Information processing method and device
CN111125301A (en) * 2019-11-22 2020-05-08 泰康保险集团股份有限公司 Text method and device, electronic equipment and computer readable storage medium
CN112328783A (en) * 2020-11-24 2021-02-05 腾讯科技(深圳)有限公司 Abstract determining method and related device

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102111505B (en) * 2011-03-04 2013-06-05 中山大学 Short message prompting display method for mobile terminal
CN102111505A (en) * 2011-03-04 2011-06-29 中山大学 Short message prompting display method for mobile terminal
CN105706079A (en) * 2013-10-31 2016-06-22 隆沙有限公司 Topic-wise collaboration integration
US10296582B2 (en) 2014-02-22 2019-05-21 Tencent Technology (Shenzhen) Company Limited Method and apparatus for determining morpheme importance analysis model
WO2015124096A1 (en) * 2014-02-22 2015-08-27 Tencent Technology (Shenzhen) Company Limited Method and apparatus for determining morpheme importance analysis model
CN108009135A (en) * 2016-10-31 2018-05-08 深圳市北科瑞声科技股份有限公司 The method and apparatus for generating documentation summary
CN108009135B (en) * 2016-10-31 2021-05-04 深圳市北科瑞声科技股份有限公司 Method and device for generating document abstract
WO2018233647A1 (en) * 2017-06-22 2018-12-27 腾讯科技(深圳)有限公司 Abstract generation method, device and computer device and storage medium
US11409960B2 (en) 2017-06-22 2022-08-09 Tencent Technology (Shenzhen) Company Limited Summary generation method, apparatus, computer device, and storage medium
CN108763206A (en) * 2018-05-22 2018-11-06 南京邮电大学 A method of quicksort is carried out to single text keyword
CN108763206B (en) * 2018-05-22 2022-04-05 南京邮电大学 Method for quickly sequencing keywords of single text
CN109325109A (en) * 2018-08-27 2019-02-12 中国人民解放军国防科技大学 Attention encoder-based extraction type news abstract generating device
CN109325109B (en) * 2018-08-27 2021-11-19 中国人民解放军国防科技大学 Attention encoder-based extraction type news abstract generating device
CN110781227A (en) * 2019-10-30 2020-02-11 中国联合网络通信集团有限公司 Information processing method and device
CN111125301A (en) * 2019-11-22 2020-05-08 泰康保险集团股份有限公司 Text method and device, electronic equipment and computer readable storage medium
CN112328783A (en) * 2020-11-24 2021-02-05 腾讯科技(深圳)有限公司 Abstract determining method and related device

Similar Documents

Publication Publication Date Title
CN101751425A (en) Method for acquiring document set abstracts and device
CN107122413B (en) Keyword extraction method and device based on graph model
CN100517330C (en) Word sense based local file searching method
CN101315624B (en) A kind of method and apparatus of text subject recommending
CN103207899B (en) Text recommends method and system
CN105243152B (en) A kind of automaticabstracting based on graph model
CN103514183B (en) Information search method and system based on interactive document clustering
CN101398814B (en) Method and system for simultaneously abstracting document summarization and key words
CN104199857B (en) A kind of tax document hierarchy classification method based on multi-tag classification
CN101625680B (en) Document retrieval method in patent field
CN101430695B (en) System and method for computing difference affinities of word
CN101446940B (en) Method and device of automatically generating a summary for document set
CN101944099B (en) Method for automatically classifying text documents by utilizing body
CN104598532A (en) Information processing method and device
CN102063469B (en) Method and device for acquiring relevant keyword message and computer equipment
CN103049470B (en) Viewpoint searching method based on emotion degree of association
CN105808526A (en) Commodity short text core word extracting method and device
CN103914478A (en) Webpage training method and system and webpage prediction method and system
CN103473317A (en) Method and equipment for extracting keywords
CN104866572A (en) Method for clustering network-based short texts
CN104077407B (en) A kind of intelligent data search system and method
CN107943824A (en) A kind of big data news category method, system and device based on LDA
CN101231634A (en) Autoabstract method for multi-document
CN102945244A (en) Chinese web page repeated document detection and filtration method based on full stop characteristic word string
CN103377239A (en) Method and device for calculating inter-textual similarity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C12 Rejection of a patent application after its publication
RJ01 Rejection of invention patent application after publication

Open date: 20100623