CN103885935A - Book section abstract generating method based on book reading behaviors - Google Patents

Book section abstract generating method based on book reading behaviors Download PDF

Info

Publication number
CN103885935A
CN103885935A CN201410090143.6A CN201410090143A CN103885935A CN 103885935 A CN103885935 A CN 103885935A CN 201410090143 A CN201410090143 A CN 201410090143A CN 103885935 A CN103885935 A CN 103885935A
Authority
CN
China
Prior art keywords
sentence
page
books
user
book
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410090143.6A
Other languages
Chinese (zh)
Other versions
CN103885935B (en
Inventor
鲁伟明
安文佳
吴江琴
庄越挺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201410090143.6A priority Critical patent/CN103885935B/en
Publication of CN103885935A publication Critical patent/CN103885935A/en
Application granted granted Critical
Publication of CN103885935B publication Critical patent/CN103885935B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a book section abstract generating method based on book reading behaviors. A book section abstract generating technology based on the book reading behaviors is essentially a document abstract generating technology, namely, the reading behaviors of a user are added into document abstract generation and applied to engineering science and education book resources. According to the book section abstract generating method, the weight of each book page in a book section is calculated by adopting a book page quantification reading behavior grading mechanism, then the book section is divided according to sentences, the similarity among the sentences is calculated according to distances, sentence weight values existing already are spread according to fashion structures, finally, based on the concept of data reconstitution, the sentences which can represent the content of the book section best are selected out to serve as a book section abstract. The reading behaviors of the user are collected and used in importance evaluation of the book pages, the corresponding book section abstract is obtained based on the concept of data reconstitution, and then the user is assisted in rapidly learning the content of the book section to improve book reading efficiency.

Description

Books chapters and sections abstraction generating method based on books reading behavior
Technical field
The present invention relates to documentation summary generation method, relate in particular to a kind of books chapters and sections abstraction generating method based on books reading behavior.
Background technology
Growing along with digital library, user is before read books, and hope can be understood books chapters and sections content information fast and accurately, urgently wishes can provide in digital library the service of books chapters and sections summary.
Books chapters and sections summarization generation is a kind of documentation summary generation method based on reading behavior in essence, by the modeling of user's reading behavior, according to behavior model, user's reading factor is added in documentation summary generating algorithm, the summary result that obtains read by user and affect.If directly adopt traditional documentation summary generation method, books chapters and sections summary may not can be accurately expressed chapters and sections content information from user's reading angle, so also just cannot meet user's demand.
In traditional reading, the destination object of readers ' reading is simple definite linguistic notation.In the beginning of reading and the end of reading, reader only obtains and obtains cognition by the content information of word, is one and departs from the existence of social encouragement.The appearance that network socialization is read, starting of making that reader selects from reading content finished to reading content, and partly or entirely process has all formed associated with social network.Be mutually related between men in community network this, reader's reading behavior often just becomes the object that needs concern and research.
Socialization reading itself is take content as core, and take social networks as tie, emphasis is shared, the reading new model of mutual exchange and effect.In the process that user reads in content, can carry out interaction with the user of same hobby, after reading finishes, can associate and contact with the masses that read same content, even form the socialization that subject under discussion merges.The overall process of share, mutual exchange and effect being read through socialization.And in these interactions, produced a large amount of new valuable contents, as comment, summary, notes, association or intersection information.
The basic summarization generation algorithm adopting in the time carrying out books chapters and sections summarization generation is the documentation summary generating algorithm (DSDR) based on data reconstruction.Documentation summary generating algorithm based on data reconstruction is a kind of removable method, the documentation summary that the method has been thought should meet a feature: from farthest reconstruct original document of the summary of results, and the expressed content information of the whole document of covering that the summary of results can be tried one's best.
On the basis of the documentation summary generating algorithm based on data reconstruction, various actions user in the time that socialization is read are taken into account, such as reading time, user's important sentences circle professional jargon is, these sentences that enclosed picture are often considered to higher representativeness, compared with the sentence of circle picture, will not have higher weighing factor with other.
Summary of the invention
The object of the invention is, for the chapters and sections that can facilitate user to understand fast books chapters and sections information summary is provided, to have provided a kind of books chapters and sections abstraction generating method based on books reading behavior.
The technical scheme that the present invention solves its technical matters employing is as follows:
The step of the books chapters and sections abstraction generating method based on books reading behavior is as follows:
1) build book page and quantize reading behavior scoring: user's reading behavior is divided into four levels from shallow to deep by reading the degree of depth, be respectively to browse level, collection level, shallow degree reading level and the degree of depth to read level, obtain the book page scoring based on user's reading behavior based on these four levels;
2) sentence weighted value propagate: by step 1) the book page scoring based on user's reading behavior obtain books page quantize score, books chapters and sections are cut apart by sentence, books page quantizes to such an extent that branch gives each sentence initial weighted value, based on the distance between sentence, utilize the popular structural sort algorithm of data to carry out the propagation of sentence weighted value;
3) books chapters and sections summarization generation: after sentence weighted value is propagated, sentence weighted value is added in the documentation summary generating algorithm based on data reconstruction, select important sentences and make a summary as chapters and sections from books chapters and sections.
Described step 1) be:
User is read the behavior of certain page by 2.1 is divided into four levels, be respectively browse level, collection level, shallow degree reads level and the degree of depth is read level, different levels have different score contributions to page;
2.2 use retention rates, turnover rate and Scoring Index decay to weigh the difficulty of reading certain level of arrival, mark with this, book page user retention rate refers to for certain book page, number of users when browsing, proceed to the ratio of the retention number of users of collection, the reading of shallow degree and degree of depth reading, book page churn rate refers to for previous step retains number of users, the ratio of the number of users that this step reduces
Set up the evaluate formula based on user's reading behavior:
V i=[(p i+q i)/p i]exp(1-p i) i=1,2,3,4
Book page user retention rate formula:
p i=U i/U 1 i=1,2,3,4
Book page churn rate formula:
q i = U i / U i - 1 i = 2,3,4 1 i = 1
Wherein: V ifor the score contribution to books page of whole user group's reading behavior i step; p ibe that i walks with respect to the retention rate of browsing; q ibe the turnover rate of i step with respect to i-1 step; U ifor proceeding to the number of users of i step;
There is dividing of priority 2.3 book page access times, the user who more first accesses and mark this book page is larger to the contribution of this page, scoring based on book page critical behavior node can calculate the significance level of book page, the significance level of book page comprehensively to divide formula equally as follows:
s j = Σ u ∈ R j W uj × S uj Σ u ∈ R j W uj
W uj = log 2 ( T j / ( t uj - t j ) ) t uj ≠ t j log 2 T j t uj = t j
S uj = Σ i = 1 L V ij
In above-mentioned formula: s jfor the score value of books j page; W ujfor the contribution weight of user u to books j page; T jfor the summation of books j accessed time of page; t ujfor for the first time time of access of user u to books j page; t jfor books j page accessed time for the first time; S ujthe score value sum of critical behavior step books j page being arrived for user u, V ijfor the score value of user u to books j i that page reaches step critical behavior step; The degree of depth and committed step number that L arrives for user u read books j page;
2.4 according to the method for above scoring can be to the every one page of books the importance in book provide the scoring of quantification, because the otherness of books reading colony, for fear of the books page scoring high phenomenon of marking because calling party number is few, in actual page evaluation procedure, calling party number and scoring are normalized, and the comprehensive grading formula that has obtained final book page is as follows:
PageScore j = [ log u j - log u J ‾ ] + [ log 2 s j - log 2 s J ‾ ]
In above formula: u jfor the number of users of browsing of book page j, s jfor the scoring to book page j, PageScore jfor the scoring of books page, utilize with the method for mean value comparison known, only browse the number of users of book page and reader all very high to the score value of this page time, comprehensive grading just can be high, feature according to user's reading behavior in books reading, set up the book page significance level appraisement system based on user's reading behavior, four levels reading by book page quantize user behavior, define by calculating the evaluation contribution margin of four levels the difficulty that user arrives to degree of depth reading level from browsing level, finally calculate by the reading behavior of user group on book page the importance that quantizes this page.
Described step 2) be:
3.1 in step 1) in provided the score PageScore of book page j j, this score has reflected the importance of page j in books, needs to consider that drawn sentence has relative importance in this page simultaneously, the relation of the importance of sentence and page score is as follows:
w i = L i * PageScore j Σ i = 1 n ( L i * PageScore j ) L i ≠ 0 0 L i = 0
W in above formula irepresent sentence v icurrent weighted value, supposes that the set of given document sentence is
Figure BDA0000475950870000042
wherein v irepresent i sentence in set V, the sentence being streaked with straight line by user be placed on set before, suppose that a front k sentence is that user streaks, ask the weighted value of sentence by being left the relation of sentence and a front k sentence;
3.2 make dis:
Figure BDA0000475950870000043
be illustrated in the distance metric mode on set V, can obtain every couple of sentence v iwith sentence v jbetween distance dis (v i, v j), order mapping represent to distribute to each sentence v iweighted value f iranking functions, vector f=[f 1..., f n] t, vectorial w=[w 1..., w n] t, wherein, if sentence v istreaked w i≠ 0, otherwise w i=0, w irepresent the initial weight value of each sentence;
3.3 are expressed as follows at the structural weight propagation algorithm of data manifold:
Step1: calculate sentence vector distance dis (v between any two i, v j), and ascending order arrangement, between the corresponding node of sentence vector, connecting a limit until obtain connected graph between two by ascending order list;
Step2: definition incidence matrix W, meets: if sentence vector v iand v jbetween corresponding point, there is a limit, W ij=exp[-dis 2(v i, v j)/2 σ 2]; If sentence vector v iand v jbetween corresponding point, there is not limit, W ij=0; And W ii=0; Step3: incidence matrix W is carried out to symmetrical standardization, obtain matrix S: S=D -1/2wD -1/2, in formula, D is diagonal matrix, the diagonal element prime implicant of diagonal matrix D D ii = Σ j = 1 n W ij ;
Step4: iterative computation f (t+1)=aSf (t)+(1-α) w is until convergence, α be a span [0,1) parameter;
Step5: order
Figure BDA0000475950870000046
represent sequence { f i(t) limit }, the limit sequence that obtains sentence weight is { f 1 * , . . . , f n * } , Sentence weight vectors is f = [ f 1 * , . . . , f n * ] T ;
3.4 in Step4, and parameter alpha is used for specifying weighted value contribution and the initial weighted value of neighbor node to this node; Because the matrix S in algorithm is a diagonal matrix, so the communication process of weighted value is symmetrical; And for the convergency value of sequence { f (t) }, calculate f *=(I-α S) -1w; Through the propagation of weighted value, just obtain the reasonable weighted value of each sentence in books chapters and sections.
Described step 3) be:
4.1 obtain books chapters and sections sentence v iweighted value
Figure BDA0000475950870000051
weighted value
Figure BDA0000475950870000052
reflect sentence v iimportance in books chapters and sections, by n weighted value
Figure BDA0000475950870000053
as the diagonal element of matrix F, n weighted value carried out to diagonal matrix,
Figure BDA0000475950870000054
obtain diagonal matrix F, diagonal matrix F is added to the documentation summary generating algorithm based on data reconstruction;
4.2 in documentation summary generative process, to redefine linear nonnegative number as follows according to the objective function of restructing algorithm:
min a i , β J = Σ i = 1 n { f i * | | v i - V T a i | | 2 + Σ j = 1 n a ij 2 β j } + γ | | β | | 1
s.t.β j≥0,a ij≥0,and a i∈R n
In above formula, the process of selecting of each sentence has added books chapters and sections sentence v iweighted value f i *, wherein a ij>=0 shows that the method only allows the additive operation of sentence in ensemble space, does not allow subtraction; β=[β simultaneously 1, β 2..., β n] tit is an auxiliary variable; If β j=0, all a 1j..., a njbe 0, the candidate's sentence that this means j row does not have selected, and γ is regular terms parameter;
The objective function of 4.3 documentation summary generating algorithms based on data reconstruction is protruding optimization problems, can guarantee globally optimal solution, now, and fixing a i, making J is 0 to the derivative of β, the minimal solution that obtains β is as follows:
β j = Σ i = 1 n a ij 2 γ
After having obtained the minimal solution of β, the minimization problem under nonnegativity restrictions can solve with Lagrangian method;
4.4 make α ijfor constraint condition a ij>=0 and A=[a ij] under Lagrangian, lagrange formula L is as follows:
L=J+Tr[αA T]=Tr[F(V-AV)(V-AV) T+diag(β) -1A TA]+γ||β|| 1+Tr[αA T],α=[α ij]
F is the diagonal matrix in step 4.1, and the element entry on diagonal matrix F diagonal line is respectively
Figure BDA0000475950870000057
Figure BDA0000475950870000061
also be a diagonal matrix, the element entry on diagonal matrix diag (β) diagonal line is respectively β 1..., β n;
4.5 lagrange formula L are as follows to A differentiate result:
∂ L ∂ A = - 2 FVV T + 2 FAVV T + 2 Adiag ( β ) - 1 + α
Order derivative be 0, can obtain being expressed as follows about α:
α=2FVV T-2FAVV T-2Adiag(β) -1
According to Karush-Kuhn-Tucker condition α ija ij=0, to the every a that is multiplied by of above formula ijobtain following equation:
(FVV T) ija ij-(FAVV T) ija ij-(Adiag(β) -1) ija ij=0
Obtain following more new formula according to above formula:
a ij ← a ij ( FVV T ) ij [ FAVV T + Adiag ( β ) - 1 ] ij
Above-mentioned more new formula iteration is carried out until convergence, finally obtained the summary sentence of books chapters and sections.
The beneficial effect that the inventive method compared with prior art has:
1. the method combines the modeling of user's reading behavior and documentation summary generation method, and the documentation summary generating algorithm based on data reconstruction is applied on books chapters and sections summarization generation, obtains the summary info of books chapters and sections;
2. the method has been carried out analysis modeling to user's reading behavior, and modeling method adopts the thought based on reading the degree of depth, and reading behavior is carried out to level division, has finally provided the comprehensive grading system of books pages, represents the significance level of books page with score height;
3. the method, take the sentence of books chapters and sections as unit, is carried out the propagation of weighted value on data stream row space according to existing sentence weighted value, finally obtains the reasonable weighted value size of each sentence, and it is more accurate to make the reflection of user behavior.
Accompanying drawing explanation
Fig. 1 is the books chapters and sections abstraction generating method system architecture diagram based on books reading behavior;
Fig. 2 is sentence weighted value transmission method block diagram of the present invention;
Fig. 3 is the library catalogue figure of the embodiment of the present invention;
Fig. 4 is the first chapters and sections schematic diagram of the embodiment of the present invention;
Fig. 5 is the chapters and sections summarization generation result figure of the embodiment of the present invention.
Embodiment
As depicted in figs. 1 and 2, the step of the books chapters and sections abstraction generating method based on books reading behavior is as follows:
1) build book page and quantize reading behavior scoring: user's reading behavior is divided into four levels from shallow to deep by reading the degree of depth, be respectively to browse level, collection level, shallow degree reading level and the degree of depth to read level, obtain the book page scoring based on user's reading behavior based on these four levels;
2) sentence weighted value propagate: by step 1) the book page scoring based on user's reading behavior obtain books page quantize score, books chapters and sections are cut apart by sentence, books page quantizes to such an extent that branch gives each sentence initial weighted value, based on the distance between sentence, utilize the popular structural sort algorithm of data to carry out the propagation of sentence weighted value;
3) books chapters and sections summarization generation: after sentence weighted value is propagated, sentence weighted value is added in the documentation summary generating algorithm based on data reconstruction, select important sentences and make a summary as chapters and sections from books chapters and sections.
Described step 1) be:
User is read the behavior of certain page by 2.1 is divided into four levels, be respectively browse level, collection level, shallow degree reads level and the degree of depth is read level, different levels have different score contributions to page;
2.2 use retention rate, turnover rate and Scoring Index decay to weigh the difficulty of reading certain level of arrival, mark with this, between scoring and retention rate, there is a kind of relation of exponential damping, mark relevant with the turnover rate of previous step in the value of a certain step, also relevant to the retention rate of starting stage, here first provide book page user retention rate and turnover rate definition, book page user retention rate refers to for certain book page, number of users when browsing, proceed to collection, the ratio of the retention number of users that shallow degree reading and the degree of depth are read, book page churn rate refers to for previous step retains number of users, the ratio of the number of users that this step reduces,
Set up the evaluate formula based on user's reading behavior:
V i=[(p i+q i)/p i]exp(1-p i) i=1,2,3,4
Book page user retention rate formula:
p i=U i/U 1 i=1,2,3,4
Book page churn rate formula:
q i = U i / U i - 1 i = 2,3,4 1 i = 1
Wherein: V ifor the score contribution to books page of whole user group's reading behavior i step; p ibe that i walks with respect to the retention rate of browsing; q ibe the turnover rate of i step with respect to i-1 step; Ui is the number of users that proceeds to i step;
There is dividing of priority 2.3 book page access times, the user who more first accesses and mark this book page is larger to the contribution of this page, if first calling party has just carried out degree of depth reading to certain page, the significance level of this page is relatively higher, scoring based on book page critical behavior node can calculate the significance level of book page, the significance level of book page comprehensively to divide formula equally as follows:
s j = Σ u ∈ R j W uj × S uj Σ u ∈ R j W uj
W uj = log 2 ( T j / ( t uj - t j ) ) t uj ≠ t j log 2 T j t uj = t j
S uj = Σ i = 1 L V ij
In above-mentioned formula: s jfor the score value of books j page; W ujfor the contribution weight of user u to books j page; T jfor the summation of books j accessed time of page; t ujfor for the first time time of access of user u to books j page; t jfor books j page accessed time for the first time; S ujthe score value sum of critical behavior step books j page being arrived for user u, V ijfor the score value of user u to books j i that page reaches step critical behavior step; The degree of depth and committed step number that L arrives for user u read books j page;
2.4 according to the method for above scoring can be to the every one page of books the importance in book provide the scoring of quantification, because the otherness of books reading colony, for fear of the books page scoring high phenomenon of marking because calling party number is few, in actual page evaluation procedure, calling party number and scoring are normalized, and the comprehensive grading formula that has obtained final book page is as follows:
PageScore j = [ log u j - log u J ‾ ] + [ log 2 s j - log 2 s J ‾ ]
In above formula: u jfor the number of users of browsing of book page j, s jfor the scoring to book page j, PageScore jfor the scoring of books page, utilize with the method for mean value comparison known, only browse the number of users of book page and reader all very high to the score value of this page time, comprehensive grading just can be high, feature according to user's reading behavior in books reading, set up the book page significance level appraisement system based on user's reading behavior, four levels reading by book page quantize user behavior, define by calculating the evaluation contribution margin of four levels the difficulty that user arrives to degree of depth reading level from browsing level, finally calculate by the reading behavior of user group on book page the importance that quantizes this page.
Described step 2) be:
3.1 in step 1) in provided the score PageScore of book page j j, this score has reflected the importance of page j in books, needs to consider that drawn sentence has relative importance in this page simultaneously, the relation of the importance of sentence and page score is as follows:
w i = L i * PageScore j Σ i = 1 n ( L i * PageScore j ) L i ≠ 0 0 L i = 0
W in above formula irepresent sentence v icurrent weighted value, supposes that the set of given document sentence is
Figure BDA0000475950870000092
Figure BDA0000475950870000093
wherein v irepresent i sentence in set V, the sentence being streaked with straight line by user be placed on set before, suppose that a front k sentence is that user streaks, ask the weighted value of sentence by being left the relation of sentence and a front k sentence;
3.2 make dis:
Figure BDA0000475950870000094
be illustrated in the distance metric mode on set V, can obtain every couple of sentence v iwith sentence v jbetween distance dis (v i, v j), order mapping represent to distribute to each sentence v ithe ranking functions of weighted value fi, vector f=[f 1..., f n] t, vectorial w=[w 1..., w n] t, wherein, if sentence vi is streaked, w i≠ 0, otherwise w i=0, w irepresent the initial weight value of each sentence;
3.3 are expressed as follows at the structural weight propagation algorithm of data manifold:
Step1: calculate sentence vector distance dis (v between any two i, v j), and ascending order arrangement, between the corresponding node of sentence vector, connecting a limit until obtain connected graph between two by ascending order list;
Step2: definition incidence matrix W, meets: if sentence vector v iand v jbetween corresponding point, there is a limit, W ij=exp[-dis 2(v i, v j)/2 σ 2]; If sentence vector v iand v jbetween corresponding point, there is not limit, W ij=0; And Wii=0; Step3: incidence matrix W is carried out to symmetrical standardization, obtain matrix S: S=D -1/2wD -1/2, in formula, D is diagonal matrix, the diagonal element prime implicant of diagonal matrix D D ii = Σ j = 1 n W ij ;
Step4: iterative computation f (t+1)=α Sf (t)+(1-α) w is until convergence, α be a span [0,1) parameter;
Step5: order the limit that represents sequence { fi (t) }, the limit sequence that obtains sentence weight is { f 1 * , . . . , f n * } , Sentence weight vectors is f = [ f 1 * , . . . , f n * ] T ;
3.4 in Step4, and parameter alpha is used for specifying weighted value contribution and the initial weighted value of neighbor node to this node; Because the matrix S in algorithm is a diagonal matrix, so the communication process of weighted value is symmetrical; And for the convergency value of sequence { f (t) }, calculate f *=(I-aS) -1w; Through the propagation of weighted value, just obtain the reasonable weighted value of each sentence in books chapters and sections.
Described step 3) be:
4.1 obtain books chapters and sections sentence v iweighted value f i *, weighted value f i *reflect sentence v iimportance in books chapters and sections, by n weighted value f i *as the diagonal element of matrix F, n weighted value carried out to diagonal matrix, i.e. F ii=f i *, obtain diagonal matrix F, diagonal matrix F is added to the documentation summary generating algorithm based on data reconstruction;
4.2 in documentation summary generative process, to redefine linear nonnegative number as follows according to the objective function of restructing algorithm:
min a i , β J = Σ i = 1 n { f i * | | v i - V T a i | | 2 + Σ j = 1 n a ij 2 β j } + γ | | β | | 1
s.t.βj≥0,a ij≥0,and a i∈R n
In above formula, the process of selecting of each sentence has added books chapters and sections sentence v iweighted value f i *, wherein a ij>=0 shows that the method only allows the additive operation of sentence in ensemble space, does not allow subtraction; Simultaneously
β=[β 1, β 2..., β n] tit is an auxiliary variable; If β j=0, all a 1j..., a njbe 0, the candidate's sentence that this means j row does not have selected, and γ is regular terms parameter;
The objective function of 4.3 documentation summary generating algorithms based on data reconstruction is protruding optimization problems, can guarantee globally optimal solution, now, and fixing a i, making J is 0 to the derivative of β, the minimal solution that obtains β is as follows:
β j = Σ i = 1 n a ij 2 γ
After having obtained the minimal solution of β, the minimization problem under nonnegativity restrictions can solve with Lagrangian method;
4.4 make α ijfor constraint condition a ij>=0 and A=[a ij] under Lagrangian, lagrange formula L is as follows:
L=J+Tr[αA T]=Tr[F(V-AV)(V-AV) T+diag(β) -1A TA]+γ||β|| 1+Tr[αA T],α=[α ij]
F is the diagonal matrix in step 4.1, and the element entry on diagonal matrix F diagonal line is respectively diag (β) is also a diagonal matrix, and the element entry on diagonal matrix diag (β) diagonal line is respectively β 1..., β n;
4.5 lagrange formula L is as follows to A differentiate result:
∂ L ∂ A = - 2 FVV T + 2 FAVV T + 2 Adiag ( β ) - 1 + α
Order
Figure BDA0000475950870000104
derivative be 0, can obtain being expressed as follows about α:
α=2FVV T-2FAVV T-2Adiag(β) -1
According to Karush-Kuhn-Tucker condition α ija ij=0, to the every a that is multiplied by of above formula ijobtain following equation:
(FVV T) ija ij-(FAVV T) ija ij-(Adiag(β) -1) ija ij=0
Obtain following more new formula according to above formula:
a ij ← a ij ( FVV T ) ij [ FAVV T + Adiag ( β ) - 1 ] ij
Above-mentioned more new formula iteration is carried out until convergence, finally obtained the summary sentence of books chapters and sections.
Embodiment
As shown in Figures 3 to 5, provided an application example of books chapters and sections abstraction generating methods.Describe below in conjunction with the method for this technology the concrete steps that this example is implemented in detail, as follows:
(1) at all books chapters and sections of pre-service of system, obtain books chapters and sections document content.Suppose that user is just at the first segment " definition " of the chapter 1 " Distributed Calculation brief introduction " of read books " Distributed Calculation principle and application ", want to know the chapters and sections summary of this joint, click Directory button, double-click corresponding chapters and sections, first system obtains the data such as the text message of these chapters and sections and user's reading behavior.
(2) type and the level read at these chapters and sections according to user's reading behavior data analysis user, the importance degree that obtains books page according to the comprehensive grading formula of books page quantizes score.
(3) text data of these chapters and sections of books is pressed to sentence and divided, the quantification score of reading setting-out behavior and books page in conjunction with user, has obtained by the initial weight value of line sentence.
(4) sentence is done to participle, remove the processing such as stop words, each sentence builds the vector of a higher dimensional space, obtains sentence similarity between any two according to the distance between vector.
(5) carry out the propagation of sentence initial weight value by the sort method on data manifold space, finally obtain the rational weighted value of each sentence.
(6) sentence weighted value matrix F is added in the documentation summary generating algorithm based on data reconstruction, execution algorithm, until the summary info of some sentences (depending on chapters and sections length) as these books chapters and sections chosen in convergence from these books chapters and sections, finally returns to user.
The operation result of this example at accompanying drawing 3 to middle demonstration, user is just at read books, can check by catalogue the clip Text of corresponding chapters and sections, facilitate the faster more detailed chapters and sections content of understanding of user, this books chapters and sections abstraction generating method has good use value and application prospect.

Claims (4)

1. the books chapters and sections abstraction generating method based on books reading behavior, is characterized in that its step is as follows:
1) build book page and quantize reading behavior scoring: user's reading behavior is divided into four levels from shallow to deep by reading the degree of depth, be respectively to browse level, collection level, shallow degree reading level and the degree of depth to read level, obtain the book page scoring based on user's reading behavior based on these four levels;
2) sentence weighted value propagate: by step 1) the book page scoring based on user's reading behavior obtain books page quantize score, books chapters and sections are cut apart by sentence, books page quantizes to such an extent that branch gives each sentence initial weighted value, based on the distance between sentence, utilize the popular structural sort algorithm of data to carry out the propagation of sentence weighted value;
3) books chapters and sections summarization generation: after sentence weighted value is propagated, sentence weighted value is added in the documentation summary generating algorithm based on data reconstruction, select important sentences and make a summary as chapters and sections from books chapters and sections.
2. according to the books chapters and sections abstraction generating method based on books reading behavior described in claim 1, it is characterized in that described step 1) be:
User is read the behavior of certain page by 2.1 is divided into four levels, be respectively browse level, collection level, shallow degree reads level and the degree of depth is read level, different levels have different score contributions to page;
2.2 use retention rates, turnover rate and Scoring Index decay to weigh the difficulty of reading certain level of arrival, mark with this, book page user retention rate refers to for certain book page, number of users when browsing, proceed to the ratio of the retention number of users of collection, the reading of shallow degree and degree of depth reading, book page churn rate refers to for previous step retains number of users, the ratio of the number of users that this step reduces
Set up the evaluate formula based on user's reading behavior:
V i=[(p i+q i)/p i]exp(1-p i) i=1,2,3,4
Book page user retention rate formula:
p i=U i/U 1 i=1,2,3,4
Book page churn rate formula:
q i = U i / U i - 1 i = 2,3,4 1 i = 1
Wherein: V ifor the score contribution to books page of whole user group's reading behavior i step; p ibe that i walks with respect to the retention rate of browsing; q ibe the turnover rate of i step with respect to i-1 step; U ifor proceeding to the number of users of i step;
There is dividing of priority 2.3 book page access times, the user who more first accesses and mark this book page is larger to the contribution of this page, calculate the significance level of book page based on the scoring of book page critical behavior node, the significance level of book page comprehensively to divide formula equally as follows:
s j = Σ u ∈ R j W uj × S uj Σ u ∈ R j W uj
W uj = log 2 ( T j / ( t uj - t j ) ) t uj ≠ t j log 2 T j t uj = t j
S uj = Σ i = 1 L V ij
In above-mentioned formula: s jfor the score value of books j page; W ujfor the contribution weight of user u to books j page; T jfor the summation of books j accessed time of page; t ujfor for the first time time of access of user u to books j page; t jfor books j page accessed time for the first time; S ujthe score value sum of critical behavior step books j page being arrived for user u, V ijfor the score value of user u to books j i that page reaches step critical behavior step; The degree of depth and committed step number that L arrives for user u read books j page;
2.4 according to the method for above scoring can be to the every one page of books the importance in book provide the scoring of quantification, because the otherness of books reading colony, for fear of the books page scoring high phenomenon of marking because calling party number is few, in actual page evaluation procedure, calling party number and scoring are normalized, and the comprehensive grading formula that has obtained final book page is as follows:
PageScore j = [ log u j - log u J ‾ ] + [ log 2 s j - log 2 s J ‾ ]
In above formula: u jfor the number of users of browsing of book page j, s jfor the scoring to book page j, PageScore jfor the scoring of books page, utilize with the method for mean value comparison known, only browse the number of users of book page and reader all very high to the score value of this page time, comprehensive grading just can be high, feature according to user's reading behavior in books reading, set up the book page significance level appraisement system based on user's reading behavior, four levels reading by book page quantize user behavior, define by calculating the evaluation contribution margin of four levels the difficulty that user arrives to degree of depth reading level from browsing level, finally calculate by the reading behavior of user group on book page the importance that quantizes this page.
3. the books chapters and sections abstraction generating method based on books reading behavior according to claim 1, is characterized in that described step 2) be:
3.1 in step 1) in provided the score PageScore of book page j j, this score has reflected the importance of page j in books, needs to consider that drawn sentence has relative importance in this page simultaneously, the relation of the importance of sentence and page score is as follows:
w i = L i * PageScore j Σ i = 1 n ( L i * PageScore j ) L i ≠ 0 0 L i = 0
W in above formula jrepresent sentence v icurrent weighted value, supposes that the set of given document sentence is
Figure FDA0000475950860000032
Figure FDA0000475950860000033
wherein v irepresent i sentence in set V, the sentence being streaked with straight line by user be placed on set before, suppose that a front k sentence is that user streaks, ask the weighted value of sentence by being left the relation of sentence and a front k sentence;
3.2 make dis: be illustrated in the distance metric mode on set V, can obtain every couple of sentence v iwith sentence v jbetween distance dis (v i, v j), order mapping
Figure FDA0000475950860000035
represent to distribute to each sentence v iweighted value f iranking functions, vector f=[f 1..., f n] t, vectorial w=[w 1..., w n] t, wherein, if sentence v istreaked w i≠ 0, otherwise w i=0, w irepresent the initial weight value of each sentence;
3.3 are expressed as follows at the structural weight propagation algorithm of data manifold:
Step1: calculate sentence vector distance dis (v between any two i, v j), and ascending order arrangement, between the corresponding node of sentence vector, connecting a limit until obtain connected graph between two by ascending order list;
Step2: definition incidence matrix W, meets: if sentence vector v iand v jbetween corresponding point, there is a limit, W ij=exp[-dis 2(v i, v j)/2 σ 2]; If sentence vector v iand v jbetween corresponding point, there is not limit, W ij=0; And W ii=0; Step3: incidence matrix W is carried out to symmetrical standardization, obtain matrix S: S=D -1/2wD -1/2, in formula, D is diagonal matrix, the diagonal element prime implicant of diagonal matrix D D ii = Σ j = 1 n W ij ;
Step4: iterative computation f (t+1)=α Sf (t)+(1-α) w is until convergence, α be a span [0,1) parameter;
Step5: make f i *represent sequence { f i(t) limit }, the limit sequence that obtains sentence weight is sentence weight vectors is
3.4 in Step4, and parameter alpha is used for specifying weighted value contribution and the initial weighted value of neighbor node to this node; Because the matrix S in algorithm is a diagonal matrix, so the communication process of weighted value is symmetrical; And for the convergency value of sequence { f (t) }, calculate f *=(I-aS) -1w; Through the propagation of weighted value, just obtain the reasonable weighted value of each sentence in books chapters and sections.
4. the books chapters and sections abstraction generating method based on books reading behavior according to claim 1, is characterized in that described step 3) be:
4.1 obtain books chapters and sections sentence v iweighted value f i *, weighted value f i *reflect sentence v iimportance in books chapters and sections, by n weighted value f i *as the diagonal element of matrix F, n weighted value carried out to diagonal matrix, i.e. F ii=f i *, obtain diagonal matrix F, add the documentation summary based on data reconstruction to generate diagonal matrix F
Algorithm;
4.2 in documentation summary generative process, to redefine linear nonnegative number as follows according to the objective function of restructing algorithm:
min a i , β J = Σ i = 1 n { f i * | | v i - V T a i | | 2 + Σ j = 1 n a ij 2 β j } + γ | | β | | 1
s.t.β j≥0,a ij≥0,and a i∈R n
In above formula, the process of selecting of each sentence has added books chapters and sections sentence v iweighted value f i *, wherein a ij>=0 shows that the method only allows the additive operation of sentence in ensemble space, does not allow subtraction; Simultaneously
β=[β 1, β 2..., β n] tit is an auxiliary variable; If β j=0, all a 1j..., a njbe 0, the candidate's sentence that this means j row does not have selected, and γ is regular terms parameter;
The objective function of 4.3 documentation summary generating algorithms based on data reconstruction is protruding optimization problems, can guarantee globally optimal solution, now, and fixing a i, making J is 0 to the derivative of β, the minimal solution that obtains β is as follows:
β j = Σ i = 1 n a ij 2 γ
After having obtained the minimal solution of β, the minimization problem under nonnegativity restrictions can solve with Lagrangian method;
4.4 make α ijfor constraint condition a ij>=0 and A=[a ij] under Lagrangian, lagrange formula L is as follows:
L=J+Tr[αA T]=Tr[F(V-AV)(V-AV) T+diag(β) -1A TA]+γ||β|| 1+Tr[αA T],α=[α ij]
F is the diagonal matrix in step 4.1, and the element entry on diagonal matrix F diagonal line is respectively
Figure FDA0000475950860000043
diag (β) is also-individual diagonal matrix that the element entry on diagonal matrix diag (β) diagonal line is respectively β 1..., β n;
4.5 lagrange formula L are as follows to A differentiate result:
∂ L ∂ A = - 2 FVV T + 2 FAVV T + 2 Adiag ( β ) - 1 + α
Order
Figure FDA0000475950860000052
derivative be 0, can obtain being expressed as follows about α:
α=2FVV T-2FAVV T-2Adiag(β) -1
According to Karush-Kuhn-Tucker condition α ija ij=0, to the every a that is multiplied by of above formula ijobtain following equation:
(FVV T) ija ij-(FAVV T) ija ij-(Adiag(β) -1) ija ij=0
Obtain following more new formula according to above formula:
a ij ← a ij ( FVV T ) ij [ FAVV T + Adiag ( β ) - 1 ] ij
Above-mentioned more new formula iteration is carried out until convergence, finally obtained the summary sentence of books chapters and sections.
CN201410090143.6A 2014-03-12 2014-03-12 Books chapters and sections abstraction generating method based on books reading behavior Active CN103885935B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410090143.6A CN103885935B (en) 2014-03-12 2014-03-12 Books chapters and sections abstraction generating method based on books reading behavior

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410090143.6A CN103885935B (en) 2014-03-12 2014-03-12 Books chapters and sections abstraction generating method based on books reading behavior

Publications (2)

Publication Number Publication Date
CN103885935A true CN103885935A (en) 2014-06-25
CN103885935B CN103885935B (en) 2016-06-29

Family

ID=50954830

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410090143.6A Active CN103885935B (en) 2014-03-12 2014-03-12 Books chapters and sections abstraction generating method based on books reading behavior

Country Status (1)

Country Link
CN (1) CN103885935B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI549003B (en) * 2014-08-18 2016-09-11 葆光資訊有限公司 Method for automatic sections division
CN106469176A (en) * 2015-08-20 2017-03-01 百度在线网络技术(北京)有限公司 A kind of method and apparatus for extracting text snippet
CN107608972A (en) * 2017-10-24 2018-01-19 河海大学 A kind of more text quick abstract methods
CN108231064A (en) * 2018-01-02 2018-06-29 联想(北京)有限公司 A kind of data processing method and system
CN109241863A (en) * 2018-08-14 2019-01-18 北京万维之道信息技术有限公司 For splitting the data processing method and device of reading content
CN111199151A (en) * 2019-12-31 2020-05-26 联想(北京)有限公司 Data processing method and data processing device
US10929452B2 (en) 2017-05-23 2021-02-23 Huawei Technologies Co., Ltd. Multi-document summary generation method and apparatus, and terminal

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020138528A1 (en) * 2000-12-12 2002-09-26 Yihong Gong Text summarization using relevance measures and latent semantic analysis
CN1614585A (en) * 2003-11-07 2005-05-11 摩托罗拉公司 Context Generality
CN102841940A (en) * 2012-08-17 2012-12-26 浙江大学 Document summary extracting method based on data reconstruction

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020138528A1 (en) * 2000-12-12 2002-09-26 Yihong Gong Text summarization using relevance measures and latent semantic analysis
CN1614585A (en) * 2003-11-07 2005-05-11 摩托罗拉公司 Context Generality
CN102841940A (en) * 2012-08-17 2012-12-26 浙江大学 Document summary extracting method based on data reconstruction

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ZHANYING HE等: "Document Summarization Based on Data Reconstruction", 《PROCEEDINGS OF THE TWENTY-SIXTY AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE》 *
ZHIMING ZHANG等: "《Web-Age Information Management》", 16 June 2013, VERLAG BERLIN HEIDELBERG *
乔少杰等: "基于中心性和PageRank的网页综合评分方法", 《西南交通大学学报》 *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI549003B (en) * 2014-08-18 2016-09-11 葆光資訊有限公司 Method for automatic sections division
CN106469176A (en) * 2015-08-20 2017-03-01 百度在线网络技术(北京)有限公司 A kind of method and apparatus for extracting text snippet
CN106469176B (en) * 2015-08-20 2019-08-16 百度在线网络技术(北京)有限公司 It is a kind of for extracting the method and apparatus of text snippet
US10929452B2 (en) 2017-05-23 2021-02-23 Huawei Technologies Co., Ltd. Multi-document summary generation method and apparatus, and terminal
CN107608972A (en) * 2017-10-24 2018-01-19 河海大学 A kind of more text quick abstract methods
CN108231064A (en) * 2018-01-02 2018-06-29 联想(北京)有限公司 A kind of data processing method and system
CN109241863A (en) * 2018-08-14 2019-01-18 北京万维之道信息技术有限公司 For splitting the data processing method and device of reading content
CN111199151A (en) * 2019-12-31 2020-05-26 联想(北京)有限公司 Data processing method and data processing device

Also Published As

Publication number Publication date
CN103885935B (en) 2016-06-29

Similar Documents

Publication Publication Date Title
CN103885935B (en) Books chapters and sections abstraction generating method based on books reading behavior
KR102302609B1 (en) Neural Network Architecture Optimization
CN112749326B (en) Information processing method, information processing device, computer equipment and storage medium
CN108846138B (en) Question classification model construction method, device and medium fusing answer information
CN110489558A (en) Polymerizable clc method and apparatus, medium and calculating equipment
CN105740448B (en) More microblogging timing abstract methods towards topic
CN104572614A (en) Training method and system for language model
CN104572631A (en) Training method and system for language model
CN115455171B (en) Text video mutual inspection rope and model training method, device, equipment and medium
CN110795944A (en) Recommended content processing method and device, and emotion attribute determining method and device
Nie et al. 2-tuple linguistic intuitionistic preference relation and its application in sustainable location planning voting system
CN112085087A (en) Method and device for generating business rules, computer equipment and storage medium
CN110874536A (en) Corpus quality evaluation model generation method and bilingual sentence pair inter-translation quality evaluation method
Joshi et al. Statistical downscaling of precipitation and temperature using sparse Bayesian learning, multiple linear regression and genetic programming frameworks
CN102298589A (en) Method and device for generating emotion tendentiousness template, and method and device for using emotion tendentiousness template
CN112734104A (en) Cross-domain recommendation method for generating countermeasure network and self-encoder by fusing double generators and double discriminators
CN113723072A (en) RPA (resilient packet Access) and AI (Artificial Intelligence) combined model fusion result acquisition method and device and electronic equipment
CN111143454B (en) Text output method and device and readable storage medium
CN117391497A (en) News manuscript quality subjective and objective scoring consistency evaluation method and system
Xiong et al. TDCTFIC: a novel recommendation framework fusing temporal dynamics, CNN-based text features and item correlation
CN102262659B (en) Audio label disseminating method based on content calculation
CN116127060A (en) Text classification method and system based on prompt words
Soutner et al. Continuous distributed representations of words as input of LSTM network language model
CN116935261A (en) Data processing method and related device
CN112883229A (en) Video-text cross-modal retrieval method and device based on multi-feature-map attention network model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant