CN104657347A - News optimized reading mobile application-oriented automatic summarization method - Google Patents
News optimized reading mobile application-oriented automatic summarization method Download PDFInfo
- Publication number
- CN104657347A CN104657347A CN201510063837.5A CN201510063837A CN104657347A CN 104657347 A CN104657347 A CN 104657347A CN 201510063837 A CN201510063837 A CN 201510063837A CN 104657347 A CN104657347 A CN 104657347A
- Authority
- CN
- China
- Prior art keywords
- sentence
- word
- text
- news
- auto
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
The invention relates to a news optimized reading mobile application-oriented automatic summarization method. The method is characterized by comprising the following steps of (1) preprocessing news webpage content; (2) extracting a text abstract; (3) generating a result. An html (Hypertext Markup Language) format is added; a picture and a table are retained; the display form of the text abstract is optimized, and the visual experience of a user is enhanced. In the traditional automatic summarization method, semantic loss is caused, but in the automatic summarization method provided by the invention, sentences are subjected to context expansion, and blank sentences are combined by being connected by suspension points, so that the semantic loss in the traditional automatic summarization method is made up, and the integrity and the continuity of semanteme are improved. Two options, i.e. percentage of abstract in an original article and length of the abstract, are set to be selectively set by the user, so that the flexibility is improved; 100 articles are randomly selected, and human checking shows that the pass rate is up to 99.8 percent.
Description
Technical field
The present invention relates to a kind of auto-abstracting method, specifically relate to a kind of auto-abstracting method reading class Mobile solution towards news optimization.
Background technology
The fast development of internet in recent years, has bulk information to appear in face of people with the form of electronic document every day.People depend on internet more and more to obtain required information, in the face of the magnanimity information that every day blows against one's face, need to filter a large amount of information, just can obtain the information needed, in order to obtain useful information quickly and accurately from magnanimity electronic information, the autoabstract process of document becomes more and more important.
Develop into present smart mobile phone from initial stage PC, people have started to hold browsing information from single traditional PC, turn to mobile phone mobile terminal.In the face of the small screen of mobile phone, also more urgent to the demand of autoabstract.
Autoabstract refers to extracts document subject matter thought automatically by computer program, generates more concise than original text, the digest be more readily understood by the important information extracted after recombinant modified.As long as read a small amount of digest namely can understand original text fast, like a cork, and need not go to read in full, substantially increase the efficiency that people obtain electronic text information.Main automatic Summarization Technique is divided into two classes at present: the mechanical method of abstracting of Corpus--based Method and Knowledge based engineering understand method of abstracting.Machinery summary Using statistics method obtains the keyword of document, and in conjunction with the heuristic information such as cue, position, picks out the sentence that some are suitable from document, obtains the summary of document after polishing.Understand summary expectation and utilize various knowledge and Formal Theory, the basis understanding document semantic content generates digest (summary or concentrated to original text).
Machinery is made a summary and is had the advantages that speed is fast, field is not limited, but the summary generated is second-rate, there is the problems such as reflection content is comprehensive not, statement redundancy.Compared with making a summary with machinery, understanding summary quality is better, has the advantages such as succinct refining, comprehensively accurate, readability are strong.But, understand summary and not only require that computing machine has natural language understanding and generative capacity, also need express and organize various background, domain knowledge.The difficulty of these work is very huge, is in progress very micro-up to now.Therefore, the use understanding method of abstracting is more rare, is only limitted in very narrow and small application.
Summary of the invention
For the deficiencies in the prior art, the present invention proposes a kind of auto-abstracting method reading class Mobile solution towards news optimization.Based on the singularity of mobile terminal, design a kind of autoabstract of tape format, improve the comfort level of Consumer's Experience.The present invention generates summary automatically in conjunction with html pattern, remains picture and the form of original text, and expansion before and after important information has been carried out, improve integrality and the continuity of content.Avoid dull, the stiff and tomography of the pattern of summary, the news optimizing mobile terminal is read.
The object of the invention is to adopt following technical proposals to realize:
Read an auto-abstracting method for class Mobile solution towards news optimization, its improvements are, described method comprises
(1) pre-service news web page content;
(2) text snippet is extracted;
(3) result is generated.
Preferably, described step (1) comprises
(1.1) dictionary and stop words is loaded;
(1.2) news web page content according to html label piecemeal, be designated as k
i;
(1.3) respectively to each k
icut sentence, the method for cutting sentence with paragraph end mark and fullstop to divide sentence;
(1.4) the html label h of every is extracted
iwith text s
i;
(1.5) h of every is recorded
iwith text s
icorrespondence position;
(1.6) to text s
iparticiple;
(1.7) remove stop words and other noise, be designated as word
i.
Further, described each word
ifor removing the word sequence after stop words.
Preferably, described step (2) comprises
(2.1) word is calculated
iand word
jco-occurrence similarity sim
i,j;
(2.2) according to formula pr
i=1-d/m+d* Σ sim
j,i* pr
j/ out
jcarry out iteration,
(2.3) according to s
ipr
ivalue carries out down sequence, generates sentence sequence s
k;
Wherein, word
ifor sentence text s
icorresponding word sequence, word
jfor sentence text s
jcorresponding word sequence, sim
i,jfor sentence i is to the contribution margin of sentence j, d ∈ (0,1), m is matrix maximal dimension, out
jfor the out-degree of sentence summit j, the initial value of pr is 1/m, and convergence precision is 0.001.
Preferably, described step (3) comprises
(3.1) from s
kl sentence before middle taking-up;
(3.2) to L sentence before taking-up, carry out front and back expansion, must s be gathered
l;
(3.3) by the order in original text, to s
lresequence to obtain s '
l;
(3.4) in conjunction with h
i, by s '
lbe inserted in correspondence position;
(3.5) continuous many all not selected, namely not set s '
lin, then merge;
(3.6) according to length or the number percent of user's setting, judge whether the length of (3.5) meets, if exceed, then cuts word, draws net result.
Compared with the prior art, beneficial effect of the present invention is:
With general autoabstract ratio, increase html form, retain picture and form, what optimize digest represents form, enhances user's visual experience.
Tradition autoabstract has semantic disappearance, and the present invention carries out context extension to sentence, and merges empty sentence and connect with suspension points, compensate for the semantic disappearance of tradition summary, improves semantic integrality and continuity.
The present invention is provided with number percent and length of summarization two options that summary accounts for original text, selects to arrange, improve dirigibility for user.
Randomly draw 100 sections of articles, through desk checking, percent of pass reaches 99.8%.
Accompanying drawing explanation
Fig. 1 is a kind of auto-abstracting method process flow diagram reading class Mobile solution towards news optimization provided by the invention.
Fig. 2 is a kind of structural drawing reading pretreatment module in the auto-abstracting method of class Mobile solution towards news optimization provided by the invention.
Fig. 3 is a kind of process flow diagram reading the auto-abstracting method Chinese version abstract extraction module of class Mobile solution towards news optimization provided by the invention.
Fig. 4 is a kind of process flow diagram reading result-generation module in the auto-abstracting method of class Mobile solution towards news optimization provided by the invention.
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.
The present invention is a kind of, and the auto-abstracting method towards news optimization reading class Mobile solution comprises the steps: to carry out pre-service to news web page content, text snippet extracts and result generates.
As shown in Figure 2, for carrying out pretreated structural drawing to news web page content, pre-service is that news web page content is first carried out piecemeal, every section of corresponding one piece of sequence of news, and each block is to application one word sequence, and concrete steps are as follows:
1. load dictionary and stop words;
2. news web page content according to html label piecemeal, be designated as k
i(i ∈ 1,2,3 ..., n), if there is form, extract form as independent block k
j, otherwise each divides a block k to beginning label into end-tag
j;
3. respectively to each k
i(i ≠ j) cuts sentence, the method for cutting sentence with paragraph end mark and fullstop to divide sentence;
4. extract the html label h of every
i(i ∈ 1,2,3 ..., m) with text s
i(i ∈ 1,2,3 ..., m);
5. record the h of every
i(i ∈ 1,2,3 ..., m) with text s
i(i ∈ 1,2,3 ..., correspondence position m);
6. to text s
i(i ∈ 1,2,3 ..., m) participle;
7. remove stop words and other noise, be designated as word
i(i ∈ 1,2,3 ..., m), each word
ifor removing the word sequence after stop words and denoising.
Fig. 3 is the process flow diagram of text snippet extraction module, and concrete steps are as follows:
1, word is calculated
iand word
jco-occurrence similarity sim
i,j;
Calculate word
i(represent sentence text s
icorresponding word sequence) and word
j(represent sentence text s
jcorresponding
Word sequence) similarity sim
i,j, sim
i,jfor sentence i is to the contribution margin of sentence j;
By sim
i,jgenerate non-directed graph matrix;
2, according to formula pr
i=1-d/m+d* Σ sim
j,i* pr
j/ out
jcarry out iteration,
According to pr
i=1-d/m+d* Σ sim
j,i* pr
j/ out
j; Carry out iteration, wherein d ∈ (0,1), m is matrix maximal dimension, out
jfor the out-degree of sentence summit j (i.e. sentence j), convergence precision is 0.001;
3, according to s
ipr
ivalue carries out down sequence, generates sentence sequence s
k;
Wherein, word
ifor sentence text s
icorresponding word sequence, word
jfor sentence text s
jcorresponding word sequence, sim
i,jfor sentence i is to the contribution margin of sentence j, d ∈ (0,1), m is matrix maximal dimension, out
jfor the out-degree of sentence summit j, the initial value of pr is 1/m, and convergence precision is 0.001.
Note: the formula in (2.2) comes from pageRank algorithm, Brin and Page, 1998
Fig. 4 is the process flow diagram of result-generation module, and the concrete steps of result template generation module are as follows:
1. from s
kl sentence before middle taking-up, wherein L ∈ (1, m);
2. pair front L sentence taken out, carries out front and back expansion, must gather s
l;
3. according to the order in original text, to s
lresequence to obtain s '
l;
4. in conjunction with h
i(i ∈ 1,2,3 ..., m) and positional information, by s '
lbe inserted in correspondence position;
If continuous many all not selected, namely not set s '
lin, then merge, and connect with ' ... ';
6., according to length or the number percent of user's setting, judge whether the length of 3.5 meets, if exceed, then cuts word, draws net result.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; those of ordinary skill in the field still can modify to the specific embodiment of the present invention with reference to above-described embodiment or equivalent replacement; these do not depart from any amendment of spirit and scope of the invention or equivalent replacement, are all applying within the claims of the present invention awaited the reply.
Claims (5)
1. read an auto-abstracting method for class Mobile solution towards news optimization, it is characterized in that, described method comprises
(1) pre-service news web page content;
(2) text snippet is extracted;
(3) result is generated.
2. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 1, it is characterized in that, described step (1) comprises
(1.1) dictionary and stop words is loaded;
(1.2) news web page content according to html label piecemeal, be designated as k
i;
(1.3) respectively to each k
icut sentence, the method for cutting sentence with paragraph end mark and fullstop to divide sentence;
(1.4) the html label h of every is extracted
iwith text s
i;
(1.5) h of every is recorded
iwith text s
icorrespondence position;
(1.6) to text s
iparticiple;
(1.7) remove stop words and other noise, be designated as word
i.
3. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 2, is characterized in that, described each word
ifor removing the word sequence after stop words.
4. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 1, it is characterized in that, described step (2) comprises
(2.1) word is calculated
iand word
jco-occurrence similarity sim
i,j;
(2.2) according to formula
Carry out iteration,
(2.3) according to s
ipr
ivalue carries out down sequence, generates sentence sequence s
k;
Wherein, word
ifor sentence text s
icorresponding word sequence, word
jfor sentence text s
jcorresponding word sequence, sim
i,jfor sentence i is to the contribution margin of sentence j, d ∈ (0,1), m is matrix maximal dimension, out
jfor the out-degree of sentence summit j, the initial value of pr is 1/m, and convergence precision is 0.001.
5. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 1, it is characterized in that, described step (3) comprises
(3.1) from s
kl sentence before middle taking-up;
(3.2) to L sentence before taking-up, carry out front and back expansion, must s be gathered
l;
(3.3) by the order in original text, to s
lresequence to obtain s
l';
(3.4) in conjunction with h
i, by s
l' be inserted in correspondence position;
(3.5) continuous many all not selected, namely not set s
l' in, then merge;
(3.6) according to length or the number percent of user's setting, judge whether the length of (3.5) meets, if exceed, then cuts word, draws net result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510063837.5A CN104657347A (en) | 2015-02-06 | 2015-02-06 | News optimized reading mobile application-oriented automatic summarization method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510063837.5A CN104657347A (en) | 2015-02-06 | 2015-02-06 | News optimized reading mobile application-oriented automatic summarization method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN104657347A true CN104657347A (en) | 2015-05-27 |
Family
ID=53248496
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510063837.5A Pending CN104657347A (en) | 2015-02-06 | 2015-02-06 | News optimized reading mobile application-oriented automatic summarization method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104657347A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105335529A (en) * | 2015-12-10 | 2016-02-17 | 天津海量信息技术有限公司 | Consistent multi-type data preprocessing method |
WO2017147785A1 (en) * | 2016-03-01 | 2017-09-08 | Microsoft Technology Licensing, Llc | Automated commentary for online content |
CN110008313A (en) * | 2019-04-11 | 2019-07-12 | 重庆华龙网海数科技有限公司 | A kind of unsupervised text snippet method of extraction-type |
CN110110238A (en) * | 2019-03-14 | 2019-08-09 | 厦门天锐科技股份有限公司 | A kind of sensitive information methods of exhibiting and device |
CN110162765A (en) * | 2018-02-11 | 2019-08-23 | 鼎复数据科技(北京)有限公司 | A kind of machine aid reading auditing method and system based on abstract mode |
US11514242B2 (en) | 2019-08-10 | 2022-11-29 | Chongqing Sizai Information Technology Co., Ltd. | Method for automatically summarizing internet web page and text information |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020194230A1 (en) * | 2001-06-19 | 2002-12-19 | Fuji Xerox Co., Ltd. | System and method for generating analytic summaries |
CN1536483A (en) * | 2003-04-04 | 2004-10-13 | 陈文中 | Method for extracting and processing network information and its system |
CN101233510A (en) * | 2005-07-26 | 2008-07-30 | 泰普有限公司 | Processing and sending search results over a wireless network to a mobile device |
CN102298638A (en) * | 2011-08-31 | 2011-12-28 | 北京中搜网络技术股份有限公司 | Method and system for extracting news webpage contents by clustering webpage labels |
WO2014035334A1 (en) * | 2012-08-30 | 2014-03-06 | Nuffnangx Pte Ltd | Semiotic selection method and system for text summarization |
-
2015
- 2015-02-06 CN CN201510063837.5A patent/CN104657347A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020194230A1 (en) * | 2001-06-19 | 2002-12-19 | Fuji Xerox Co., Ltd. | System and method for generating analytic summaries |
CN1536483A (en) * | 2003-04-04 | 2004-10-13 | 陈文中 | Method for extracting and processing network information and its system |
CN101233510A (en) * | 2005-07-26 | 2008-07-30 | 泰普有限公司 | Processing and sending search results over a wireless network to a mobile device |
CN102298638A (en) * | 2011-08-31 | 2011-12-28 | 北京中搜网络技术股份有限公司 | Method and system for extracting news webpage contents by clustering webpage labels |
WO2014035334A1 (en) * | 2012-08-30 | 2014-03-06 | Nuffnangx Pte Ltd | Semiotic selection method and system for text summarization |
Non-Patent Citations (2)
Title |
---|
张东晋: "基于单事件新闻多文档聚类及自动文摘的设计与实现", 《中国优秀硕士论文全文数据 信息科技辑》 * |
黄文蓓: "基于网页分割和摘要的小屏幕设备网页自适应技术研究与实现", 《中国优秀硕士论文全文数据库 信息科技辑》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105335529A (en) * | 2015-12-10 | 2016-02-17 | 天津海量信息技术有限公司 | Consistent multi-type data preprocessing method |
WO2017147785A1 (en) * | 2016-03-01 | 2017-09-08 | Microsoft Technology Licensing, Llc | Automated commentary for online content |
US11922300B2 (en) | 2016-03-01 | 2024-03-05 | Microsoft Technology Licensing, Llc. | Automated commentary for online content |
CN110162765A (en) * | 2018-02-11 | 2019-08-23 | 鼎复数据科技(北京)有限公司 | A kind of machine aid reading auditing method and system based on abstract mode |
CN110110238A (en) * | 2019-03-14 | 2019-08-09 | 厦门天锐科技股份有限公司 | A kind of sensitive information methods of exhibiting and device |
CN110008313A (en) * | 2019-04-11 | 2019-07-12 | 重庆华龙网海数科技有限公司 | A kind of unsupervised text snippet method of extraction-type |
US11514242B2 (en) | 2019-08-10 | 2022-11-29 | Chongqing Sizai Information Technology Co., Ltd. | Method for automatically summarizing internet web page and text information |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104657347A (en) | News optimized reading mobile application-oriented automatic summarization method | |
US10318614B2 (en) | Transformation of marked-up content into a file format that enables automated browser based pagination | |
CN102184189B (en) | Webpage core block determining method based on DOM (Document Object Model) node text density | |
CN102253979B (en) | Vision-based web page extracting method | |
Sun et al. | Dom based content extraction via text density | |
CN102541874B (en) | Webpage text content extracting method and device | |
CN104598577B (en) | A kind of extracting method of Web page text | |
US20130185633A1 (en) | Low resolution placeholder content for document navigation | |
CN102591612B (en) | General webpage text extraction method based on punctuation continuity and system thereof | |
CN102915361B (en) | Webpage text extracting method based on character distribution characteristic | |
CN103577171B (en) | A kind of method and mobile terminal of display web page contents | |
CN101727461A (en) | Method for extracting content of web page | |
CN102262625A (en) | Method and device for extracting keywords of page | |
CN104317786A (en) | Method and system for segmenting text paragraphs | |
CN109710947A (en) | Power specialty word stock generating method and device | |
CN104063380A (en) | Method and device for converting picture files into webpage files | |
CN105320734A (en) | Web page core content extraction method | |
CN110781291A (en) | Text abstract extraction method, device, server and readable storage medium | |
CN109213480A (en) | A kind of method, storage medium, equipment and system for developing the back-stage management page | |
Liu et al. | Main content extraction from web pages based on node characteristics | |
CN101986289B (en) | Method and device for increasing browser page rendering speed | |
CN106126496A (en) | A kind of information segmenting method and device | |
WO2018179002A1 (en) | Transformation of marked-up content into a file format that enables automated browser based pagination | |
CN104536947A (en) | Layout document processing method and device | |
CN104636431A (en) | Automatic extraction and optimizing method for document abstracts of different fields |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C41 | Transfer of patent application or patent right or utility model | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20160412 Address after: 100086, No. 2, building 43, No. 5 West Third Ring Road, Haidian District, Beijing, 01-03A Applicant after: Beijing Wyatt Network Technology Co. Ltd. Address before: Shou Heng Technology Building No. 51 Beijing 100191 Haidian District Xueyuan Road room 0902 Applicant before: Beijing Zhongsou Network Technology Co,Ltd |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20150527 |