CN101567004B

CN101567004B - English text automatic abstracting method based on eye tracking

Info

Publication number: CN101567004B
Application number: CN2009100960607A
Authority: CN
Inventors: 徐颂华; 江浩; 刘智满
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2009-02-06
Filing date: 2009-02-06
Publication date: 2012-05-30
Anticipated expiration: 2029-02-06
Also published as: CN101567004A

Abstract

The invention relates to an English text automatic abstracting method based on eye tracking. Existing methods can not generate a personalized text abstract aiming at different readers. The method comprises the following steps of: obtaining the concerning time of a user to all words in a text when reading an electronic text by utilizing an eyeball tracking device or a camera; predicating the user interest in all sentences based on test similarity; and generating a personalized automatic abstracting result by utilizing the combination of user interest and text automatic abstracting algorithm. The method can effectively combine the user interest in an English text automatic abstracting process so as to lead the final test automatic abstracting result to be more similar to the abstract contentexpected by the user, thereby causing automatic abstracting software to provide better personalized service to the user.

Description

English text automatic abstracting method based on eye tracking

Technical field

The invention belongs to computer information retrieval and man-machine interface field, relate to a kind of personalized English text automatic abstracting method based on eye tracking.

Background technology

Current number of research projects and the achievement on the problem of computer English text summarization, made comprises to general file and to the autoabstract of certain specific knowledge field document.For example: Richard's ultraman people such as (Richard Alterman) went up " autoabstract at details place " (" the Summarization in the Small ") that proposes in 1986 at " cognitive science progress " (" Advances in Cognitive Science "), went up " text summarization " (" the Text Summarization ") that proposes in 1992 at " artificial intelligence encyclopedia " (" Encyclopedia of Artificial Intelligence "); " the automatic generation of spoken dialog simplified summary in non-strict field " that clo Si Zeqinei (Klaus Zechner) proposed in the SIGIR2001 meeting in calendar year 2001 (" Automatic Generation of Concise Summaries of SpokenDialogues in Unrestricted Domains "); " excavation of film comment and autoabstract " (" Movie ReviewMining and Summarization ") that Zu Wang people such as (L.Zuang) proposed in the CIKM2006 meeting in 2006; " using the autoabstract of supervised and semi-supervised formula study to extract " papers such as (" Extractive Summarization Using Supervised and Semi-SupervisedLearning ") that king people such as (Wong) proposed in Coling 2008 meetings in 2008.Randt husband people such as (Radev) is in the MEAD abstract system of exploitation in 2003; By GIN system of the CLAIR research group of Univ Michigan-Ann Arbor USA in exploitation in 2007.Above method does not all produce personalized text snippet to different readers, can not satisfy reader's demand.

Summary of the invention

The objective of the invention is to overcome the deficiency of prior art, a kind of personalized English text automatic abstracting method based on eye tracking is provided.

The inventive method may further comprise the steps:

Step 1) obtain the user when reading electronic document to literary composition in concern time of all speech, concrete grammar is:

(a) user concerned time with all speech in the literary composition is initialized as 0.

(b) every interval is 0.1 second, through eye tracking appearance or camera get access to the focal position of user's eyeball on screen (x, y).Utilize eye tracking appearance or camera to get access to user's eyeball focal position on screen and be the existing method of maturation.

(c) each the speech wi position on current screen in the literary composition be (xi, yi), then this speech is after at interval constantly, the recruitment AT of its user concerned time (wi) is:

AT (w_{i}) = 0.1 \exp (- \frac{{(x_{i} - x)}^{2}}{2 {k_{x}}^{2}} - \frac{{(y_{i} - y)}^{2}}{2 {k_{y}}^{2}})

Wherein kx and ky are respectively mean breadth and the average height of each speech on screen in the literary composition, and AT (wi) unit is second.

(d) repeating step (b) and (c) read this electronic document to the user obtains the user concerned time of each speech in the literary composition.

Step 2) based on the user interest degree of all sentences in the text similarity prediction literary composition, concrete steps are:

(e) calculate the semantic similarity Sim between any two the speech wi and wj in the literary composition (wi, wj); This similarity is the real number of a span between [0,1].Concrete computing method adopt by (Y.Li) people of etc.ing Lee and go up " a kind of method of utilizing multiple information source measurement semanteme of word similarity " (" the An approach for measuring semantic similaritybetween words using multiple information sources ") of proposition IEEE knowledge in 2003 with data engineering journal (IEEE Transactions onKnowledge and Data Engineering).

(f) to any speech w in the text, to pick out in the text and k maximum speech of its similarity, the k value is that (10, n), wherein n is the number of all different speech in the text to min; The k that setting an is picked out speech is w1, w2 ..., wk, the user interest degree of through type (1) prediction speech w:

I (w) = \frac{Σ_{i = 1}^{k} (AT (w_{i}) {Sim}^{γ} (w_{i}, w) δ (w_{i}, w))}{Σ_{i = 1}^{k} ({Sim}^{γ} (w_{i}, w) δ (w_{i}, w)) + ϵ} - - - (1)

Wherein γ is a constant, and the value that is used for controlling Sim () accounts for many proportion; ε is the positive integer constant, and being used for the denominator of the formula that prevents (1) is 0; Function δ () is defined as with removing the low text of similarity:

δ (w_{i}, w) = \{\begin{matrix} 1 & If {Sim}^{γ} (w_{i}, w) > 0.01 \\ 0 & Otherwise \end{matrix}

(g) the user interest degree sum of all various words is the user interest degree I (s) of this sentence among any sentence s in the text.

Step 3) utilizes the user interest degree to combine the text summarization algorithm to generate personalized autoabstract result, and concrete grammar is:

(h) the text snippet length of setting user needs is the c% of text size, utilizes the text summarization algorithm based on semantic analysis to obtain the text snippet result of compressibility for c%.Wherein based on the existing maturation method of text summarization algorithm use of semantic analysis, like Word AutoSummarize or MEAD.

(i), calculate the side-play amount I of its user interest degree to each the sentence s in the text _Offset(s):

I_{offset} (s) = (1 - k) \max_{i = 1}^{m} {I (s_{i})} λ (s)

Wherein I (si) is the user interest degree of sentence si, s1, and s2 ..., sm is a sentence all in the text, m is the sentence sum in the text.If sentence s appears among the resulting summary result of step (h), then λ (s) value is 1; If sentence s does not appear among the resulting summary result of step (h), then λ (s) value is 0.K is a free parameter, and span is 0～1.

(j) the adjusted user interest degree of each the sentence s I in the calculating text _Adj(s):

I _adj(s)＝I(s)+I _offset(s)

(k) all the sentence s in the text are selected the summary result of the sentence of preceding c% as the text from high to low by its adjusted user interest degree.

The inventive method is combined in user's hobby in the process of English text automatic abstracting effectively, makes final text snippet result more near the clip Text of user expectation, thereby makes autoabstract software better personalized service to be provided for the user.

Description of drawings

Fig. 1 is the process flow diagram of the inventive method embodiment.

Embodiment

Like Fig. 1; English text automatic abstracting method based on eye tracking comprises with lower module: eye tracking device 10, user concerned time sample collection 20, user interest degree prediction 30, traditional text auto-abstracting method 40, user interest degree adjustment 50, text summarization result 60, and concrete steps are following:

(b) every interval is 0.1 second, through the eye tracking device get access to the focal position of user's eyeball on screen (x, y).The eye tracking device adopts common camera (Logitech QuickCam NotebookPro) the collocation opengazer of eye tracking system that increases income to assemble.

AT (w_{i}) = 0.1 \exp (- \frac{{(x_{i} - x)}^{2}}{2 {k_{x}}^{2}} - \frac{{(y_{i} - y)}^{2}}{2 {k_{y}}^{2}})

(d) repeating step (b) and (c) read this electronic document to the user obtains the user concerned time of each speech in the literary composition.Module user concerned time sample collection 20, each that the eye tracking system is got access to ocular focusing location records are constantly got off, and the user concerned time of each speech in the text that adds up.

(e) the semantic similarity Sim (wi between any two speech wi and the wj in the calculating literary composition; Wj), concrete computing method adopt by (Y.Li) people of etc.ing Lee and go up " a kind of method of utilizing multiple information source measurement semanteme of word similarity " (" the An approach formeasuring semantic similarity between words using multiple informationsources ") of proposition IEEE knowledge in 2003 with data engineering journal (IEEE Transactions on Knowledge and Data Engineering).

I (w) = \frac{Σ_{i = 1}^{k} (AT (w_{i}) {Sim}^{γ} (w_{i}, w) δ (w_{i}, w))}{Σ_{i = 1}^{k} ({Sim}^{γ} (w_{i}, w) δ (w_{i}, w)) + ϵ} - - - (1)

δ (w_{i}, w) = \{\begin{matrix} 1 & If {Sim}^{γ} (w_{i}, w) > 0.01 \\ 0 & Otherwise \end{matrix}

(h) the text snippet length of setting user needs is the c% of text size, utilizes the MEAD English text automatic abstracting method to obtain the text snippet result of compressibility for c%.

I_{offset} (s) = (1 - k) \max_{i = 1}^{m} {I (s_{i})} λ (s)

Wherein I (si) is the user interest degree of sentence si, s1, and s2 ..., sm is a sentence all in the text, m is the sentence sum in the text.K be one can be by the parameter of user-defined value between [0,1], the information of having represented the user concerned time of obtaining from the eye tracking device shared ratio among the autoabstract results; If k=1, the result that then makes a summary is determined by user concerned time fully; If k=0, the result that then makes a summary has nothing to do with user concerned time fully, is equivalent to direct use MEAD system.If sentence s appears among the resulting summary result of step (h), then λ (s) value is 1; If sentence s does not appear among the resulting summary result of step (h), then λ (s) value is 0.K is a free parameter, and span is 0～1, and preset value is 0.5.

I _adj(s)＝I(s)+I _offset(s)

Utilizing present embodiment is that 10%, 20%, 30% o'clock recall ratio (Recall), precision ratio (Precision) and F ratio (F-rate) contrasts as follows in compressibility respectively to the autoabstract result of on " science " electronic document 60 pieces science and technology type articles of publication and the system MS Word AutoSummarize of two traditional auto-abstracting methods of employing and the summary result's that MEAD obtains performance:

Can find out that the inventive method all increases with respect to existing method performance under three kinds of compressibilitys.

Claims

1. based on the English text automatic abstracting method of eye tracking, it is characterized in that the concrete steps of this method are:

(a) user concerned time with all speech in the literary composition is initialized as 0;

(b) every interval is 0.1 second, through eye tracking appearance or camera get access to the focal position of user's eyeball on screen (x, y);

AT (w_{i}) = 0.1 \exp (- \frac{{(x_{i} - x)}^{2}}{2 {k_{x}}^{2}} - \frac{{(y_{i} - y)}^{2}}{2 {k_{y}}^{2}})

Wherein kx and ky are respectively mean breadth and the average height of each speech on screen in the literary composition;

(d) repeating step (b) and (c) read this electronic document to the user obtains the user concerned time of each speech in the literary composition;

Step 2) based on the user interest degree of all sentences in the text similarity prediction literary composition, concrete grammar is:

(e) calculate the semantic similarity Sim between any two the speech wi and wj in the literary composition (wi, wj); This similarity is the real number of a span between [0,1];

(f) to any speech w in the document, to pick out in the document and k maximum speech of its similarity, the k value is that (10, n), wherein n is the number of all different speech in the document to min; The k that setting an is picked out speech is w1, w2 ..., wk, the user interest degree of through type (1) prediction speech w:

I (w) = \frac{Σ_{i = 1}^{k} (AT (w_{i}) {Sim}^{γ} (w_{i}, w) δ (w_{i}, w))}{Σ_{i = 1}^{k} ({Sim}^{γ} (w_{i}, w) δ (w_{i}, w)) + ϵ} - - - (1)

Wherein γ is that constant, ε are the positive integer constant, and function δ () is defined as:

δ (w_{i}, w) = \{\begin{matrix} 1 & If {Sim}^{γ} (w_{i}, w) > 0.01 \\ 0 & Otherwise \end{matrix}

(g) the user interest degree sum of all various words is the user interest degree I (s) of this sentence among any sentence s in the document;

(h) the text snippet length of setting user needs is the c% of document length, utilizes the text summarization algorithm based on semantic analysis to obtain the text snippet result of compressibility for c%;

(i), calculate the side-play amount I of its user interest degree to each the sentence s in the document _Offset(s):

I_{offset} (s) = (1 - k) \max_{i = 1}^{m} {I (s_{i})} λ (s)

Wherein I (si) is the user interest degree of sentence si, s1, and s2 ..., sm is a sentence all in the document, m is the sentence sum in the document; If sentence s appears among the resulting summary result of step (h), then λ (s) value is 1; If sentence s does not appear among the resulting summary result of step (h), then λ (s) value is 0; K is a free parameter, and span is 0～1;

(j) the adjusted user interest degree of each the sentence s I in the calculating document _Adj(s):

I _adj(s)＝I(s)+I _offset(s)

(k) all the sentence s in the document are selected the summary result of the sentence of preceding c% as the document from high to low by its adjusted user interest degree.