CN104657347A - News optimized reading mobile application-oriented automatic summarization method - Google Patents

News optimized reading mobile application-oriented automatic summarization method Download PDF

Info

Publication number
CN104657347A
CN104657347A CN201510063837.5A CN201510063837A CN104657347A CN 104657347 A CN104657347 A CN 104657347A CN 201510063837 A CN201510063837 A CN 201510063837A CN 104657347 A CN104657347 A CN 104657347A
Authority
CN
China
Prior art keywords
sentence
word
text
news
auto
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510063837.5A
Other languages
Chinese (zh)
Inventor
尹柳
许欢庆
郭永福
陈沛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wyatt Network Technology Co. Ltd.
Original Assignee
Beijing Zhongsou Network Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongsou Network Technology Co ltd filed Critical Beijing Zhongsou Network Technology Co ltd
Priority to CN201510063837.5A priority Critical patent/CN104657347A/en
Publication of CN104657347A publication Critical patent/CN104657347A/en
Pending legal-status Critical Current

Links

Abstract

The invention relates to a news optimized reading mobile application-oriented automatic summarization method. The method is characterized by comprising the following steps of (1) preprocessing news webpage content; (2) extracting a text abstract; (3) generating a result. An html (Hypertext Markup Language) format is added; a picture and a table are retained; the display form of the text abstract is optimized, and the visual experience of a user is enhanced. In the traditional automatic summarization method, semantic loss is caused, but in the automatic summarization method provided by the invention, sentences are subjected to context expansion, and blank sentences are combined by being connected by suspension points, so that the semantic loss in the traditional automatic summarization method is made up, and the integrity and the continuity of semanteme are improved. Two options, i.e. percentage of abstract in an original article and length of the abstract, are set to be selectively set by the user, so that the flexibility is improved; 100 articles are randomly selected, and human checking shows that the pass rate is up to 99.8 percent.

Description

A kind of auto-abstracting method reading class Mobile solution towards news optimization
Technical field
The present invention relates to a kind of auto-abstracting method, specifically relate to a kind of auto-abstracting method reading class Mobile solution towards news optimization.
Background technology
The fast development of internet in recent years, has bulk information to appear in face of people with the form of electronic document every day.People depend on internet more and more to obtain required information, in the face of the magnanimity information that every day blows against one's face, need to filter a large amount of information, just can obtain the information needed, in order to obtain useful information quickly and accurately from magnanimity electronic information, the autoabstract process of document becomes more and more important.
Develop into present smart mobile phone from initial stage PC, people have started to hold browsing information from single traditional PC, turn to mobile phone mobile terminal.In the face of the small screen of mobile phone, also more urgent to the demand of autoabstract.
Autoabstract refers to extracts document subject matter thought automatically by computer program, generates more concise than original text, the digest be more readily understood by the important information extracted after recombinant modified.As long as read a small amount of digest namely can understand original text fast, like a cork, and need not go to read in full, substantially increase the efficiency that people obtain electronic text information.Main automatic Summarization Technique is divided into two classes at present: the mechanical method of abstracting of Corpus--based Method and Knowledge based engineering understand method of abstracting.Machinery summary Using statistics method obtains the keyword of document, and in conjunction with the heuristic information such as cue, position, picks out the sentence that some are suitable from document, obtains the summary of document after polishing.Understand summary expectation and utilize various knowledge and Formal Theory, the basis understanding document semantic content generates digest (summary or concentrated to original text).
Machinery is made a summary and is had the advantages that speed is fast, field is not limited, but the summary generated is second-rate, there is the problems such as reflection content is comprehensive not, statement redundancy.Compared with making a summary with machinery, understanding summary quality is better, has the advantages such as succinct refining, comprehensively accurate, readability are strong.But, understand summary and not only require that computing machine has natural language understanding and generative capacity, also need express and organize various background, domain knowledge.The difficulty of these work is very huge, is in progress very micro-up to now.Therefore, the use understanding method of abstracting is more rare, is only limitted in very narrow and small application.
Summary of the invention
For the deficiencies in the prior art, the present invention proposes a kind of auto-abstracting method reading class Mobile solution towards news optimization.Based on the singularity of mobile terminal, design a kind of autoabstract of tape format, improve the comfort level of Consumer's Experience.The present invention generates summary automatically in conjunction with html pattern, remains picture and the form of original text, and expansion before and after important information has been carried out, improve integrality and the continuity of content.Avoid dull, the stiff and tomography of the pattern of summary, the news optimizing mobile terminal is read.
The object of the invention is to adopt following technical proposals to realize:
Read an auto-abstracting method for class Mobile solution towards news optimization, its improvements are, described method comprises
(1) pre-service news web page content;
(2) text snippet is extracted;
(3) result is generated.
Preferably, described step (1) comprises
(1.1) dictionary and stop words is loaded;
(1.2) news web page content according to html label piecemeal, be designated as k i;
(1.3) respectively to each k icut sentence, the method for cutting sentence with paragraph end mark and fullstop to divide sentence;
(1.4) the html label h of every is extracted iwith text s i;
(1.5) h of every is recorded iwith text s icorrespondence position;
(1.6) to text s iparticiple;
(1.7) remove stop words and other noise, be designated as word i.
Further, described each word ifor removing the word sequence after stop words.
Preferably, described step (2) comprises
(2.1) word is calculated iand word jco-occurrence similarity sim i,j;
(2.2) according to formula pr i=1-d/m+d* Σ sim j,i* pr j/ out jcarry out iteration,
(2.3) according to s ipr ivalue carries out down sequence, generates sentence sequence s k;
Wherein, word ifor sentence text s icorresponding word sequence, word jfor sentence text s jcorresponding word sequence, sim i,jfor sentence i is to the contribution margin of sentence j, d ∈ (0,1), m is matrix maximal dimension, out jfor the out-degree of sentence summit j, the initial value of pr is 1/m, and convergence precision is 0.001.
Preferably, described step (3) comprises
(3.1) from s kl sentence before middle taking-up;
(3.2) to L sentence before taking-up, carry out front and back expansion, must s be gathered l;
(3.3) by the order in original text, to s lresequence to obtain s ' l;
(3.4) in conjunction with h i, by s ' lbe inserted in correspondence position;
(3.5) continuous many all not selected, namely not set s ' lin, then merge;
(3.6) according to length or the number percent of user's setting, judge whether the length of (3.5) meets, if exceed, then cuts word, draws net result.
Compared with the prior art, beneficial effect of the present invention is:
With general autoabstract ratio, increase html form, retain picture and form, what optimize digest represents form, enhances user's visual experience.
Tradition autoabstract has semantic disappearance, and the present invention carries out context extension to sentence, and merges empty sentence and connect with suspension points, compensate for the semantic disappearance of tradition summary, improves semantic integrality and continuity.
The present invention is provided with number percent and length of summarization two options that summary accounts for original text, selects to arrange, improve dirigibility for user.
Randomly draw 100 sections of articles, through desk checking, percent of pass reaches 99.8%.
Accompanying drawing explanation
Fig. 1 is a kind of auto-abstracting method process flow diagram reading class Mobile solution towards news optimization provided by the invention.
Fig. 2 is a kind of structural drawing reading pretreatment module in the auto-abstracting method of class Mobile solution towards news optimization provided by the invention.
Fig. 3 is a kind of process flow diagram reading the auto-abstracting method Chinese version abstract extraction module of class Mobile solution towards news optimization provided by the invention.
Fig. 4 is a kind of process flow diagram reading result-generation module in the auto-abstracting method of class Mobile solution towards news optimization provided by the invention.
Embodiment
Below in conjunction with accompanying drawing, the specific embodiment of the present invention is described in further detail.
The present invention is a kind of, and the auto-abstracting method towards news optimization reading class Mobile solution comprises the steps: to carry out pre-service to news web page content, text snippet extracts and result generates.
As shown in Figure 2, for carrying out pretreated structural drawing to news web page content, pre-service is that news web page content is first carried out piecemeal, every section of corresponding one piece of sequence of news, and each block is to application one word sequence, and concrete steps are as follows:
1. load dictionary and stop words;
2. news web page content according to html label piecemeal, be designated as k i(i ∈ 1,2,3 ..., n), if there is form, extract form as independent block k j, otherwise each divides a block k to beginning label into end-tag j;
3. respectively to each k i(i ≠ j) cuts sentence, the method for cutting sentence with paragraph end mark and fullstop to divide sentence;
4. extract the html label h of every i(i ∈ 1,2,3 ..., m) with text s i(i ∈ 1,2,3 ..., m);
5. record the h of every i(i ∈ 1,2,3 ..., m) with text s i(i ∈ 1,2,3 ..., correspondence position m);
6. to text s i(i ∈ 1,2,3 ..., m) participle;
7. remove stop words and other noise, be designated as word i(i ∈ 1,2,3 ..., m), each word ifor removing the word sequence after stop words and denoising.
Fig. 3 is the process flow diagram of text snippet extraction module, and concrete steps are as follows:
1, word is calculated iand word jco-occurrence similarity sim i,j;
Calculate word i(represent sentence text s icorresponding word sequence) and word j(represent sentence text s jcorresponding
Word sequence) similarity sim i,j, sim i,jfor sentence i is to the contribution margin of sentence j;
By sim i,jgenerate non-directed graph matrix;
2, according to formula pr i=1-d/m+d* Σ sim j,i* pr j/ out jcarry out iteration,
According to pr i=1-d/m+d* Σ sim j,i* pr j/ out j; Carry out iteration, wherein d ∈ (0,1), m is matrix maximal dimension, out jfor the out-degree of sentence summit j (i.e. sentence j), convergence precision is 0.001;
3, according to s ipr ivalue carries out down sequence, generates sentence sequence s k;
Wherein, word ifor sentence text s icorresponding word sequence, word jfor sentence text s jcorresponding word sequence, sim i,jfor sentence i is to the contribution margin of sentence j, d ∈ (0,1), m is matrix maximal dimension, out jfor the out-degree of sentence summit j, the initial value of pr is 1/m, and convergence precision is 0.001.
Note: the formula in (2.2) comes from pageRank algorithm, Brin and Page, 1998
Fig. 4 is the process flow diagram of result-generation module, and the concrete steps of result template generation module are as follows:
1. from s kl sentence before middle taking-up, wherein L ∈ (1, m);
2. pair front L sentence taken out, carries out front and back expansion, must gather s l;
3. according to the order in original text, to s lresequence to obtain s ' l;
4. in conjunction with h i(i ∈ 1,2,3 ..., m) and positional information, by s ' lbe inserted in correspondence position;
If continuous many all not selected, namely not set s ' lin, then merge, and connect with ' ... ';
6., according to length or the number percent of user's setting, judge whether the length of 3.5 meets, if exceed, then cuts word, draws net result.
Finally should be noted that: above embodiment is only in order to illustrate that technical scheme of the present invention is not intended to limit; those of ordinary skill in the field still can modify to the specific embodiment of the present invention with reference to above-described embodiment or equivalent replacement; these do not depart from any amendment of spirit and scope of the invention or equivalent replacement, are all applying within the claims of the present invention awaited the reply.

Claims (5)

1. read an auto-abstracting method for class Mobile solution towards news optimization, it is characterized in that, described method comprises
(1) pre-service news web page content;
(2) text snippet is extracted;
(3) result is generated.
2. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 1, it is characterized in that, described step (1) comprises
(1.1) dictionary and stop words is loaded;
(1.2) news web page content according to html label piecemeal, be designated as k i;
(1.3) respectively to each k icut sentence, the method for cutting sentence with paragraph end mark and fullstop to divide sentence;
(1.4) the html label h of every is extracted iwith text s i;
(1.5) h of every is recorded iwith text s icorrespondence position;
(1.6) to text s iparticiple;
(1.7) remove stop words and other noise, be designated as word i.
3. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 2, is characterized in that, described each word ifor removing the word sequence after stop words.
4. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 1, it is characterized in that, described step (2) comprises
(2.1) word is calculated iand word jco-occurrence similarity sim i,j;
(2.2) according to formula pr i = 1 - d / m + d * Σ sim j , i * pr j / out j Carry out iteration,
(2.3) according to s ipr ivalue carries out down sequence, generates sentence sequence s k;
Wherein, word ifor sentence text s icorresponding word sequence, word jfor sentence text s jcorresponding word sequence, sim i,jfor sentence i is to the contribution margin of sentence j, d ∈ (0,1), m is matrix maximal dimension, out jfor the out-degree of sentence summit j, the initial value of pr is 1/m, and convergence precision is 0.001.
5. a kind of auto-abstracting method reading class Mobile solution towards news optimization as claimed in claim 1, it is characterized in that, described step (3) comprises
(3.1) from s kl sentence before middle taking-up;
(3.2) to L sentence before taking-up, carry out front and back expansion, must s be gathered l;
(3.3) by the order in original text, to s lresequence to obtain s l';
(3.4) in conjunction with h i, by s l' be inserted in correspondence position;
(3.5) continuous many all not selected, namely not set s l' in, then merge;
(3.6) according to length or the number percent of user's setting, judge whether the length of (3.5) meets, if exceed, then cuts word, draws net result.
CN201510063837.5A 2015-02-06 2015-02-06 News optimized reading mobile application-oriented automatic summarization method Pending CN104657347A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510063837.5A CN104657347A (en) 2015-02-06 2015-02-06 News optimized reading mobile application-oriented automatic summarization method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510063837.5A CN104657347A (en) 2015-02-06 2015-02-06 News optimized reading mobile application-oriented automatic summarization method

Publications (1)

Publication Number Publication Date
CN104657347A true CN104657347A (en) 2015-05-27

Family

ID=53248496

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510063837.5A Pending CN104657347A (en) 2015-02-06 2015-02-06 News optimized reading mobile application-oriented automatic summarization method

Country Status (1)

Country Link
CN (1) CN104657347A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335529A (en) * 2015-12-10 2016-02-17 天津海量信息技术有限公司 Consistent multi-type data preprocessing method
WO2017147785A1 (en) * 2016-03-01 2017-09-08 Microsoft Technology Licensing, Llc Automated commentary for online content
CN110008313A (en) * 2019-04-11 2019-07-12 重庆华龙网海数科技有限公司 A kind of unsupervised text snippet method of extraction-type
CN110110238A (en) * 2019-03-14 2019-08-09 厦门天锐科技股份有限公司 A kind of sensitive information methods of exhibiting and device
CN110162765A (en) * 2018-02-11 2019-08-23 鼎复数据科技(北京)有限公司 A kind of machine aid reading auditing method and system based on abstract mode
US11514242B2 (en) 2019-08-10 2022-11-29 Chongqing Sizai Information Technology Co., Ltd. Method for automatically summarizing internet web page and text information

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194230A1 (en) * 2001-06-19 2002-12-19 Fuji Xerox Co., Ltd. System and method for generating analytic summaries
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
CN101233510A (en) * 2005-07-26 2008-07-30 泰普有限公司 Processing and sending search results over a wireless network to a mobile device
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
WO2014035334A1 (en) * 2012-08-30 2014-03-06 Nuffnangx Pte Ltd Semiotic selection method and system for text summarization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194230A1 (en) * 2001-06-19 2002-12-19 Fuji Xerox Co., Ltd. System and method for generating analytic summaries
CN1536483A (en) * 2003-04-04 2004-10-13 陈文中 Method for extracting and processing network information and its system
CN101233510A (en) * 2005-07-26 2008-07-30 泰普有限公司 Processing and sending search results over a wireless network to a mobile device
CN102298638A (en) * 2011-08-31 2011-12-28 北京中搜网络技术股份有限公司 Method and system for extracting news webpage contents by clustering webpage labels
WO2014035334A1 (en) * 2012-08-30 2014-03-06 Nuffnangx Pte Ltd Semiotic selection method and system for text summarization

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张东晋: "基于单事件新闻多文档聚类及自动文摘的设计与实现", 《中国优秀硕士论文全文数据 信息科技辑》 *
黄文蓓: "基于网页分割和摘要的小屏幕设备网页自适应技术研究与实现", 《中国优秀硕士论文全文数据库 信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105335529A (en) * 2015-12-10 2016-02-17 天津海量信息技术有限公司 Consistent multi-type data preprocessing method
WO2017147785A1 (en) * 2016-03-01 2017-09-08 Microsoft Technology Licensing, Llc Automated commentary for online content
US11922300B2 (en) 2016-03-01 2024-03-05 Microsoft Technology Licensing, Llc. Automated commentary for online content
CN110162765A (en) * 2018-02-11 2019-08-23 鼎复数据科技(北京)有限公司 A kind of machine aid reading auditing method and system based on abstract mode
CN110110238A (en) * 2019-03-14 2019-08-09 厦门天锐科技股份有限公司 A kind of sensitive information methods of exhibiting and device
CN110008313A (en) * 2019-04-11 2019-07-12 重庆华龙网海数科技有限公司 A kind of unsupervised text snippet method of extraction-type
US11514242B2 (en) 2019-08-10 2022-11-29 Chongqing Sizai Information Technology Co., Ltd. Method for automatically summarizing internet web page and text information

Similar Documents

Publication Publication Date Title
CN104657347A (en) News optimized reading mobile application-oriented automatic summarization method
US10318614B2 (en) Transformation of marked-up content into a file format that enables automated browser based pagination
CN102184189B (en) Webpage core block determining method based on DOM (Document Object Model) node text density
CN102253979B (en) Vision-based web page extracting method
Sun et al. Dom based content extraction via text density
CN102541874B (en) Webpage text content extracting method and device
CN104598577B (en) A kind of extracting method of Web page text
US20130185633A1 (en) Low resolution placeholder content for document navigation
CN102591612B (en) General webpage text extraction method based on punctuation continuity and system thereof
CN102915361B (en) Webpage text extracting method based on character distribution characteristic
CN103577171B (en) A kind of method and mobile terminal of display web page contents
CN101727461A (en) Method for extracting content of web page
CN102262625A (en) Method and device for extracting keywords of page
CN104317786A (en) Method and system for segmenting text paragraphs
CN109710947A (en) Power specialty word stock generating method and device
CN104063380A (en) Method and device for converting picture files into webpage files
CN105320734A (en) Web page core content extraction method
CN110781291A (en) Text abstract extraction method, device, server and readable storage medium
CN109213480A (en) A kind of method, storage medium, equipment and system for developing the back-stage management page
Liu et al. Main content extraction from web pages based on node characteristics
CN101986289B (en) Method and device for increasing browser page rendering speed
CN106126496A (en) A kind of information segmenting method and device
WO2018179002A1 (en) Transformation of marked-up content into a file format that enables automated browser based pagination
CN104536947A (en) Layout document processing method and device
CN104636431A (en) Automatic extraction and optimizing method for document abstracts of different fields

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20160412

Address after: 100086, No. 2, building 43, No. 5 West Third Ring Road, Haidian District, Beijing, 01-03A

Applicant after: Beijing Wyatt Network Technology Co. Ltd.

Address before: Shou Heng Technology Building No. 51 Beijing 100191 Haidian District Xueyuan Road room 0902

Applicant before: Beijing Zhongsou Network Technology Co,Ltd

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20150527