CN104657347A

CN104657347A - News optimized reading mobile application-oriented automatic summarization method

Info

Publication number: CN104657347A
Application number: CN201510063837.5A
Authority: CN
Inventors: 尹柳; 许欢庆; 郭永福; 陈沛
Original assignee: Beijing Zhongsou Network Technology Co ltd
Current assignee: Beijing Wyatt Network Technology Co. Ltd.
Priority date: 2015-02-06
Filing date: 2015-02-06
Publication date: 2015-05-27

Abstract

The invention relates to a news optimized reading mobile application-oriented automatic summarization method. The method is characterized by comprising the following steps of (1) preprocessing news webpage content; (2) extracting a text abstract; (3) generating a result. An html (Hypertext Markup Language) format is added; a picture and a table are retained; the display form of the text abstract is optimized, and the visual experience of a user is enhanced. In the traditional automatic summarization method, semantic loss is caused, but in the automatic summarization method provided by the invention, sentences are subjected to context expansion, and blank sentences are combined by being connected by suspension points, so that the semantic loss in the traditional automatic summarization method is made up, and the integrity and the continuity of semanteme are improved. Two options, i.e. percentage of abstract in an original article and length of the abstract, are set to be selectively set by the user, so that the flexibility is improved; 100 articles are randomly selected, and human checking shows that the pass rate is up to 99.8 percent.

Description

A kind of auto-abstracting method reading class Mobile solution towards news optimization

Technical field

The present invention relates to a kind of auto-abstracting method, specifically relate to a kind of auto-abstracting method reading class Mobile solution towards news optimization.

Background technology

The fast development of internet in recent years, has bulk information to appear in face of people with the form of electronic document every day.People depend on internet more and more to obtain required information, in the face of the magnanimity information that every day blows against one's face, need to filter a large amount of information, just can obtain the information needed, in order to obtain useful information quickly and accurately from magnanimity electronic information, the autoabstract process of document becomes more and more important.

Develop into present smart mobile phone from initial stage PC, people have started to hold browsing information from single traditional PC, turn to mobile phone mobile terminal.In the face of the small screen of mobile phone, also more urgent to the demand of autoabstract.

Autoabstract refers to extracts document subject matter thought automatically by computer program, generates more concise than original text, the digest be more readily understood by the important information extracted after recombinant modified.As long as read a small amount of digest namely can understand original text fast, like a cork, and need not go to read in full, substantially increase the efficiency that people obtain electronic text information.Main automatic Summarization Technique is divided into two classes at present: the mechanical method of abstracting of Corpus--based Method and Knowledge based engineering understand method of abstracting.Machinery summary Using statistics method obtains the keyword of document, and in conjunction with the heuristic information such as cue, position, picks out the sentence that some are suitable from document, obtains the summary of document after polishing.Understand summary expectation and utilize various knowledge and Formal Theory, the basis understanding document semantic content generates digest (summary or concentrated to original text).

Machinery is made a summary and is had the advantages that speed is fast, field is not limited, but the summary generated is second-rate, there is the problems such as reflection content is comprehensive not, statement redundancy.Compared with making a summary with machinery, understanding summary quality is better, has the advantages such as succinct refining, comprehensively accurate, readability are strong.But, understand summary and not only require that computing machine has natural language understanding and generative capacity, also need express and organize various background, domain knowledge.The difficulty of these work is very huge, is in progress very micro-up to now.Therefore, the use understanding method of abstracting is more rare, is only limitted in very narrow and small application.

Summary of the invention

For the deficiencies in the prior art, the present invention proposes a kind of auto-abstracting method reading class Mobile solution towards news optimization.Based on the singularity of mobile terminal, design a kind of autoabstract of tape format, improve the comfort level of Consumer's Experience.The present invention generates summary automatically in conjunction with html pattern, remains picture and the form of original text, and expansion before and after important information has been carried out, improve integrality and the continuity of content.Avoid dull, the stiff and tomography of the pattern of summary, the news optimizing mobile terminal is read.

The object of the invention is to adopt following technical proposals to realize:

Read an auto-abstracting method for class Mobile solution towards news optimization, its improvements are, described method comprises

(1) pre-service news web page content;

(2) text snippet is extracted;

(3) result is generated.

Preferably, described step (1) comprises

(1.1) dictionary and stop words is loaded;

(1.2) news web page content according to html label piecemeal, be designated as k _i;

(1.3) respectively to each k _icut sentence, the method for cutting sentence with paragraph end mark and fullstop to divide sentence;

(1.4) the html label h of every is extracted _iwith text s _i;

(1.5) h of every is recorded _iwith text s _icorrespondence position;

(1.6) to text s _iparticiple;

(1.7) remove stop words and other noise, be designated as word _i.

Further, described each word _ifor removing the word sequence after stop words.

Preferably, described step (2) comprises

(2.1) word is calculated _iand word _jco-occurrence similarity sim _i,j;

(2.2) according to formula pr _i=1-d/m+d* Σ sim _j,i* pr _j/ out _jcarry out iteration,

(2.3) according to s _ipr _ivalue carries out down sequence, generates sentence sequence s _k;

Wherein, word _ifor sentence text s _icorresponding word sequence, word _jfor sentence text s _jcorresponding word sequence, sim _i,jfor sentence i is to the contribution margin of sentence j, d ∈ (0,1), m is matrix maximal dimension, out _jfor the out-degree of sentence summit j, the initial value of pr is 1/m, and convergence precision is 0.001.

Preferably, described step (3) comprises

(3.1) from s _kl sentence before middle taking-up;

(3.2) to L sentence before taking-up, carry out front and back expansion, must s be gathered _l;

(3.3) by the order in original text, to s _lresequence to obtain s ' _l;

(3.4) in conjunction with h _i, by s ' _lbe inserted in correspondence position;

(3.5) continuous many all not selected, namely not set s ' _lin, then merge;

(3.6) according to length or the number percent of user's setting, judge whether the length of (3.5) meets, if exceed, then cuts word, draws net result.

Compared with the prior art, beneficial effect of the present invention is:

With general autoabstract ratio, increase html form, retain picture and form, what optimize digest represents form, enhances user's visual experience.

Tradition autoabstract has semantic disappearance, and the present invention carries out context extension to sentence, and merges empty sentence and connect with suspension points, compensate for the semantic disappearance of tradition summary, improves semantic integrality and continuity.

The present invention is provided with number percent and length of summarization two options that summary accounts for original text, selects to arrange, improve dirigibility for user.

Randomly draw 100 sections of articles, through desk checking, percent of pass reaches 99.8%.

Accompanying drawing explanation

Fig. 1 is a kind of auto-abstracting method process flow diagram reading class Mobile solution towards news optimization provided by the invention.

Fig. 2 is a kind of structural drawing reading pretreatment module in the auto-abstracting method of class Mobile solution towards news optimization provided by the invention.

Fig. 3 is a kind of process flow diagram reading the auto-abstracting method Chinese version abstract extraction module of class Mobile solution towards news optimization provided by the invention.

Fig. 4 is a kind of process flow diagram reading result-generation module in the auto-abstracting method of class Mobile solution towards news optimization provided by the invention.

Embodiment

Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.

The present invention is a kind of, and the auto-abstracting method towards news optimization reading class Mobile solution comprises the steps: to carry out pre-service to news web page content, text snippet extracts and result generates.

As shown in Figure 2, for carrying out pretreated structural drawing to news web page content, pre-service is that news web page content is first carried out piecemeal, every section of corresponding one piece of sequence of news, and each block is to application one word sequence, and concrete steps are as follows:

1. load dictionary and stop words;

2. news web page content according to html label piecemeal, be designated as k _i(i ∈ 1,2,3 ..., n), if there is form, extract form as independent block k _j, otherwise each divides a block k to beginning label into end-tag _j;

3. respectively to each k _i(i ≠ j) cuts sentence, the method for cutting sentence with paragraph end mark and fullstop to divide sentence;

4. extract the html label h of every _i(i ∈ 1,2,3 ..., m) with text s _i(i ∈ 1,2,3 ..., m);

5. record the h of every _i(i ∈ 1,2,3 ..., m) with text s _i(i ∈ 1,2,3 ..., correspondence position m);

6. to text s _i(i ∈ 1,2,3 ..., m) participle;

7. remove stop words and other noise, be designated as word _i(i ∈ 1,2,3 ..., m), each word _ifor removing the word sequence after stop words and denoising.

Fig. 3 is the process flow diagram of text snippet extraction module, and concrete steps are as follows:

1, word is calculated _iand word _jco-occurrence similarity sim _i,j;

Calculate word _i(represent sentence text s _icorresponding word sequence) and word _j(represent sentence text s _jcorresponding

Word sequence) similarity sim _i,j, sim _i,jfor sentence i is to the contribution margin of sentence j;

By sim _i,jgenerate non-directed graph matrix;

2, according to formula pr _i=1-d/m+d* Σ sim _j,i* pr _j/ out _jcarry out iteration,

According to pr _i=1-d/m+d* Σ sim _j,i* pr _j/ out _j; Carry out iteration, wherein d ∈ (0,1), m is matrix maximal dimension, out _jfor the out-degree of sentence summit j (i.e. sentence j), convergence precision is 0.001;

3, according to s _ipr _ivalue carries out down sequence, generates sentence sequence s _k;

Note: the formula in (2.2) comes from pageRank algorithm, Brin and Page, 1998

Fig. 4 is the process flow diagram of result-generation module, and the concrete steps of result template generation module are as follows:

1. from s _kl sentence before middle taking-up, wherein L ∈ (1, m);

2. pair front L sentence taken out, carries out front and back expansion, must gather s _l;

3. according to the order in original text, to s _lresequence to obtain s ' _l;

4. in conjunction with h _i(i ∈ 1,2,3 ..., m) and positional information, by s ' _lbe inserted in correspondence position;

If continuous many all not selected, namely not set s ' _lin, then merge, and connect with ' ... ';

6., according to length or the number percent of user's setting, judge whether the length of 3.5 meets, if exceed, then cuts word, draws net result.

Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; those of ordinary skill in the field still can modify to the specific embodiment of the present invention with reference to above-described embodiment or equivalent replacement; these do not depart from any amendment of spirit and scope of the invention or equivalent replacement, are all applying within the claims of the present invention awaited the reply.

Claims

1. read an auto-abstracting method for class Mobile solution towards news optimization, it is characterized in that, described method comprises

(1) pre-service news web page content;

(2) text snippet is extracted;

(3) result is generated.

2. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 1, it is characterized in that, described step (1) comprises

(1.1) dictionary and stop words is loaded;

(1.4) the html label h of every is extracted _iwith text s _i;

(1.5) h of every is recorded _iwith text s _icorrespondence position;

(1.6) to text s _iparticiple;

(1.7) remove stop words and other noise, be designated as word _i.

3. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 2, is characterized in that, described each word _ifor removing the word sequence after stop words.

4. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 1, it is characterized in that, described step (2) comprises

(2.1) word is calculated _iand word _jco-occurrence similarity sim _i,j;

(2.2) according to formula

{pr}_{i} = 1 - d / m + d * Σ^{{sim}_{j, i}} * {pr}_{j} / {out}_{j}

Carry out iteration,

5. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 1, it is characterized in that, described step (3) comprises

(3.1) from s _kl sentence before middle taking-up;

(3.3) by the order in original text, to s _lresequence to obtain s _l';

(3.4) in conjunction with h _i, by s _l' be inserted in correspondence position;

(3.5) continuous many all not selected, namely not set s _l' in, then merge;