CN104462394A - System and method for recognizing content posts of webpage - Google Patents

System and method for recognizing content posts of webpage Download PDF

Info

Publication number
CN104462394A
CN104462394A CN201410758368.4A CN201410758368A CN104462394A CN 104462394 A CN104462394 A CN 104462394A CN 201410758368 A CN201410758368 A CN 201410758368A CN 104462394 A CN104462394 A CN 104462394A
Authority
CN
China
Prior art keywords
text
node
webpage
module
dom tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410758368.4A
Other languages
Chinese (zh)
Other versions
CN104462394B (en
Inventor
陈营营
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Hongxiang Technical Service Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201410758368.4A priority Critical patent/CN104462394B/en
Priority claimed from CN201210214079.9A external-priority patent/CN102779170B/en
Publication of CN104462394A publication Critical patent/CN104462394A/en
Application granted granted Critical
Publication of CN104462394B publication Critical patent/CN104462394B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a system for recognizing content posts of webpages. The system comprises a webpage analysis and layout module, a node recognition module, a post dividing module and a mobile terminal page generating module, wherein the webpage analysis and layout module is suitable for analyzing source codes of webpages and executing layout calculation of the analytical result to generate a document object model (DOM) tree of webpages; the node recognition module is suitable for distinguishing content nodes and spam-word nodes by traversing from the root nodes of the DOM tree; the post dividing module is suitable for dividing the recognized content nodes according to webpages' posts; and the mobile terminal page generating module is suitable for generating a mobile terminal page. After recognizing and extracting contents of the traditional internet webpages, the system and the method for distinguishing content floor of webpage can extract the bulletin board system (BBS) contents, news contents and comments, and restores the feature of 'post-by-post' display of contents in original webpages, the display effects maintain original 'multi-post' feature and bring wonderful reading experience to users.

Description

A kind of system and method identifying text floor of webpage
The divisional application that patented claim of the present invention is the applying date is on 06 25th, 2012, application number is 201210214079.9, name is called the Chinese invention patent application of " a kind of system and method identifying text floor of webpage ".
Technical field
The present invention relates to internet arena, particularly a kind of method identifying text floor of webpage.
Background technology
Along with the development of mobile terminal and universal, people more and more use mobile terminal to browse webpage.But because special processing is not mostly done to representing of mobile terminal in the website on internet, so the distortion representing generation on mobile terminals of most of webpage, cause the reading experience extreme difference of user's body.
The method of current improvement user reading experience extracts Web page text, resets, be more again presented to user.For having, news, the information webpage effect of large section content are better, but can discard user comment, and text is divided into the forum etc. in many " building ", effect is poorer: can only identify certain Stall text or can not identify text.Do not reject the rubbish word information in source web page, and the content of webpage does not have fixed effect, there will be webpage and the source web page effect of generation.
Summary of the invention
The present invention seeks to solve current text extracting technology to the dependence of maximum text segment with for the bad situation of many " building " contents processing, make when carrying out text extracting to webpage and resetting, not only can identify, extract body, also can identify the comment content of news analysis, and identify many " building " content in forum.
Identify a system for Web page text, described system comprises:
Web analysis layout modules, is suitable for the source code of analyzing web page, carries out layout calculation to analysis result, the dom tree of generating web page;
Node identification module, is suitable for traveling through from the root node of dom tree, identifies the text node in dom tree and/or rubbish word node;
Floor divides module, and the text node be suitable for identifying divides by the floor of webpage.
Wherein, the node of described dom tree divides according to the label in homepages language.
Wherein, described system comprises mobile terminal page generation module, is suitable for generating the mobile terminal page.
Wherein, described root node is body node.
Wherein, the dom tree of described webpage only retains the main element of webpage afterwards in generation.
Wherein, described webpage main body element comprises text, image link and/or text form.
Wherein, described node identification module comprises:
Statistical module, is suitable for calculating the spatial distribution of nodes value of each Webpage, text density and/or rubbish word density;
Analysis module, is suitable for analyzing described spatial distribution of nodes value showing that each node of each Webpage forms situation, and by described text density and/or waste density compared with default corresponding threshold value;
Text recognition module, being suitable for above-mentioned text density and/or the rubbish word density content recognition dropped in corresponding threshold value is text.
Wherein, described spatial distribution of nodes value, represents that the child node of a node forms situation, comprises the ratio that the number of various label and label account for child node;
Described text density, represents the average text size that the text size in a node obtains divided by child node number;
Described rubbish word density, represents the value of length divided by text sizes all in node of all rubbish words in a node.
Wherein, described rubbish word identifies based on dictionary.
Wherein, described floor division module comprises:
Position divides module, is suitable for dividing floor according to the position relationship of text node on dom tree; And/or
Feature Words divides module, is suitable for dividing floor according to the Feature Words in webpage.
Wherein, described position divides module to carry out dividing the rule of institute's foundation as follows:
If two text nodes are adjacent on dom tree, then these two nodes belong to same floor;
If with other, a text node has judged that the text node belonging to same floor has identical father node, then these text nodes have belonged to same floor;
If two text nodes public father node be root node, then two text nodes are divided into different floors;
And, if the internodal relation of text does not comprise in these cases, be then divided into different floor.
Wherein, described Feature Words comprises the author information in text node and/or the time of delivering in non-text node, hour of log-on or news analysis.
Wherein, described rubbish word node is after recognition as the foundation that text node floor divides.
Wherein, described mobile terminal page generation module comprises:
Layout generation module, is suitable for the content of text node by the floor divided again layout generate the mobile terminal page.
Identify a method for Web page text, described method comprises:
The source code of analyzing web page, carries out layout calculation to analysis result, the dom tree of generating web page;
Travel through from the root node of dom tree, identify the text node in dom tree and/or rubbish word node;
The text node identified is divided by the floor of webpage.
Wherein, the node of described dom tree divides according to the label in homepages language.
Wherein, after the described text node to identifying divides by the floor of webpage, generate the mobile terminal page.
Wherein, described root node is body node.
Wherein, the dom tree of described webpage only retains the main element of webpage afterwards in generation.
Wherein, described webpage main body element comprises text, image link and/or text form.
Wherein, the text node in described other dom tree and/or the process of rubbish word node comprise:
Calculate the spatial distribution of nodes value of each Webpage, text density and/or rubbish word density;
Described spatial distribution of nodes value is analyzed and show that each node of each Webpage forms situation, and by described text density and/or waste density compared with default corresponding threshold value;
Be text by above-mentioned text density and/or the rubbish word density content recognition dropped in corresponding threshold value.
Wherein, described spatial distribution of nodes value, represents that the child node of a node forms situation, comprises the ratio that the number of various label and label account for child node;
Described text density, represents the average text size that the text size in a node obtains divided by child node number;
Described rubbish word density, represents the value of length divided by text sizes all in node of all rubbish words in a node.
Wherein, described rubbish word identifies based on dictionary.
Wherein, the described method divided by the floor of webpage the text node identified is as follows:
According to the position relationship of text node on dom tree, floor is divided; And/or
According to the Feature Words in webpage, floor is divided.
Wherein, the described rule carrying out dividing institute's foundation to floor according to the text node position relationship on dom tree is as follows:
If two text nodes are adjacent on dom tree, then these two nodes belong to same floor;
If with other, a text node has judged that the text node belonging to same floor has identical father node, then these text nodes have belonged to same floor;
If two text nodes public father node be root node, then two text nodes are divided into different floors;
And, if the internodal relation of text does not comprise in these cases, be then divided into different floor.
Wherein, described Feature Words comprises the author information in text node and/or the time of delivering in non-text node, hour of log-on or news analysis.
Wherein, described rubbish word node is after recognition as the foundation that text node floor divides.
Wherein, the process of the described generation mobile terminal page is:
By the content of text node by the floor divided again layout generate the mobile terminal page.
The present invention carries out after identification extracts for the content of conventional internet webpage, can effectively extract BBS text, body and comment, and reduce " point building " of body matter in former webpage and represent feature, represent effect and keep former " many buildings " feature, to provide outstanding reading experience to user.
Accompanying drawing explanation
Fig. 1 is the structural drawing of present system
Fig. 2 is the process flow diagram of the inventive method
Fig. 3 is the dom tree generated according to the present invention
Fig. 4 is the mobile terminal webpage schematic diagram generated according to the dom tree of Fig. 3
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are suitable for the present invention is described, but are not used for limiting the scope of the invention.
The structural drawing of system provided by the invention as shown in Figure 1.
Web analysis layout modules 100 carries out web page source code to resolve and layout calculation.Resolve html source code and layout time adopt HTML analytics engine, the conventional HTML analytics engine such as webkit that increases income.Resolve and layout according to the label in web page source code, can according to but be not limited to div label, the dom tree of generating web page, and calculate the position and highly of each node display when webpage represents.The dom tree generated as shown in Figure 3.
Due on mobile terminals, the dynamic effect of internet web page is difficult to display, therefore needs dynamic effect to give up in the process generating dom tree, only retains the text formatting of image link and text.
Node identification module 200 travels through from body node whole dom tree, carries out the identification of body matter and rubbish word content, and its algorithm mainly adopts the algorithm can sorted out data rule, typically such as decision Tree algorithms.
Node identification module 200 comprises statistical module, comparison module and text recognition module.First the spatial distribution of nodes value of each Webpage, text density and rubbish word density is calculated by statistical module; Then comparison module by spatial distribution of nodes value, text density and rubbish word density with preset threshold value compared with; The content recognition finally spatial distribution of nodes value, text density and rubbish word density in dom tree being dropped on part in threshold value by text recognition module 300 is text.
Wherein, spatial distribution of nodes, represents that the child node of a node forms situation, the number of label as various in div, img, table etc., accounts for the ratio situation of child node; Text density, represents that a node Chinese version length obtains average text size divided by child node number; Rubbish word density (remittance of anon-normal cliction), represents the value of length divided by text sizes all in node of all advertising words in a node.Rubbish word identifies based on dictionary, by manually safeguarding, as have nothing to do with text in webpage Print Preview, push up, popular comment on, word and the phrase such as to comment on without hot topic temporarily.
Above-mentioned three features draw a threshold value according to decision Tree algorithms, and the node within threshold range is all identified as text, and all the other are identified as rubbish word.
Floor divides module and comprises position division module and Feature Words division module.
It is carry out floor division and identification according to the path of text node on dom tree and position relationship that position divides module, and during division, the rule of institute's foundation is as follows:
If 1 two text nodes are adjacent on dom tree, then these two nodes belong to same floor
As shown in Figure 3, br represents newline, and br label is sky label.Text node 1 and text node 2 have identical father node div1, and text node 1 and text 2 node are adjacent, then text node 1 and text node 2 identifiable design are the node in same floor
If 2 one text nodes and other judged that the text node belonging to same floor has identical father node, then these text nodes belong to same floor
As text node 3 in figure 3 and text node 2, text node 1 have identical father node div1, and text node 2 and text node 1 have been determined and have belonged to same Stall, so text node 3 also belongs to identical floor.
If 3 two text nodes public father node be body, firm two text nodes are divided into different floors.
As the text node 1 in Fig. 3 and text node 4, its path in dom tree respectively:
Text 1 → div1 → body
Text 4 → div3 → body
The public father node in its path is body, so should be identified as different floors.
If the internodal relation of 4 text does not comprise in these cases, be then divided into different floor.
Feature time divides module and mainly divides according to the feature in node time, and it is represent with the relevant information of author that the author of such as BBS text, Domestic News comment delivers content simultaneously, and for alternately to represent, is generally:
Author information → text → author information → text → author information → text
By identifying the keyword (as delivering time, hour of log-on etc.) showing author information in non-text node, further " floor " being carried out to text and divides.
Mobile terminal page generation module comprises layout generation module, by the content of text node by the floor divided again layout generate the mobile terminal page.In above process, according to the dom tree shown in Fig. 3, the floor distribution results of text node as shown in Figure 4, that is,
Floor 1: text 1, text 2, text 3;
Floor 2: text 4;
Floor 3: text 5, text 6.
The process flow diagram of method provided by the invention as shown in Figure 2.
S102, web page source code to be resolved and layout calculation.Resolve html source code and layout time adopt HTML analytics engine, the conventional HTML analytics engine such as webkit that increases income.Resolve and layout according to the label in web page source code, mainly div label, the dom tree of generating web page, and the position and highly calculating each node display when webpage represents.The dom tree generated as shown in Figure 3.
Due on mobile terminals, the dynamic effect of internet web page is difficult to display, therefore needs dynamic effect to give up in the process generating dom tree, only retains the text formatting of image link and text.
S104, travel through from body node whole dom tree, carry out the identification of body matter and rubbish word content, its algorithm mainly adopts the algorithm can sorted out data rule, typically such as decision Tree algorithms.
First the spatial distribution of nodes value of each Webpage, text density and rubbish word density is calculated; Then by spatial distribution of nodes value, text density and rubbish word density with preset threshold value compared with; Be finally text by the content recognition not exceeding threshold portion in dom tree.
Wherein, spatial distribution of nodes, represents that the child node of a node forms situation, the number of label as various in div, img, table etc., accounts for the ratio situation of child node; Text density, represents that a node Chinese version length obtains average text size divided by child node number; Rubbish word density (remittance of anon-normal cliction), represents the value of length divided by text sizes all in node of all advertising words in a node.Rubbish word identifies based on dictionary, by manually safeguarding, as have nothing to do with text in webpage Print Preview, push up, popular comment on, word and the phrase such as to comment on without hot topic temporarily.
Above-mentioned three features draw a threshold value according to decision Tree algorithms, and the node within threshold range is all identified as text, and all the other are identified as rubbish word.
S106, divide by the floor of webpage the text node identified, the method for use comprises opsition dependent and divides and divide by Feature Words.
It is carry out floor division and identification according to the path of text node on dom tree and position relationship that opsition dependent divides, and during division, the rule of institute's foundation is as follows:
If 1 two text nodes are adjacent on dom tree, then these two nodes belong to same floor
As shown in Figure 3, br represents newline, and br label is sky label.Text node 1 and text node 2 have identical father node div1, and text node 1 and text 2 node are adjacent, then text node 1 and text node 2 identifiable design are the node in same floor
If 2 one text nodes and other judged that the text node belonging to same floor has identical father node, then these text nodes belong to same floor
As text node 3 in figure 3 and text node 2, text node 1 have identical father node div1, and text node 2 and text node 1 have been determined and have belonged to same Stall, so text node 3 also belongs to identical floor.
If 3 two text nodes public father node be body, firm two text nodes are divided into different floors.
As the text node 1 in Fig. 3 and text node 4, its path in dom tree respectively:
Text 1 → div1 → body
Text 4 → div3 → body
The public father node in its path is body, so should be identified as different floors.
If the internodal relation of 4 text does not comprise in these cases, be then divided into different floor
Dividing by feature time is divide according to the feature in text time.Such as the author of BBS text, Domestic News comment delivers content is represent with the relevant information of author simultaneously, and for alternately to represent, is generally:
Author information → text → author information → text → author information → text
By identifying the keyword (as delivering time, hour of log-on etc.) showing author information in non-text node, further " floor " being carried out to text and divides.
Generate the mobile terminal page, the content of text node is generated the mobile terminal page by the floor divided again layout.In above process, according to the dom tree shown in Fig. 3, the floor distribution results of text node as shown in Figure 4, that is,
Floor 1: text 1, text 2, text 3;
Floor 2: text 4;
Floor 3: text 5, text 6.
Should be noted that, in all parts of controller of the present invention, the function that will realize according to it and logical partitioning has been carried out to parts wherein, but, the present invention is not limited to this, can repartition all parts as required or combine, such as, can be single parts by some component combinations, or some parts can be decomposed into more subassembly further.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the some or all parts in the controller of the embodiment of the present invention.The present invention can also be embodied as part or all the equipment or device program (such as, computer program and computer program) that are suitable for performing method as described herein.Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims (10)

1. identify a system for Web page text, it is characterized in that, described system comprises:
Web analysis layout modules, is suitable for the source code of analyzing web page, carries out layout calculation to analysis result, the dom tree of generating web page;
Node identification module, is suitable for traveling through from the root node of dom tree, identifies the text node in dom tree and/or rubbish word node;
Floor divides module, and the text node be suitable for identifying divides by the floor of webpage.
2. the system as claimed in claim 1, is characterized in that, the node of described dom tree divides according to the label in homepages language.
3. the system as described in any one of claim 1-2, is characterized in that, described system comprises mobile terminal page generation module, is suitable for generating the mobile terminal page.
4. the system as described in any one of claim 1-3, is characterized in that, described root node is body node.
5. the system as described in any one of claim 1-4, is characterized in that, the dom tree of described webpage only retains the main element of webpage afterwards in generation.
6. the system as described in any one of claim 1-5, is characterized in that, described webpage main body element comprises text, image link and/or text form.
7. the system as described in any one of claim 1-6, is characterized in that, described node identification module comprises:
Statistical module, is suitable for calculating the spatial distribution of nodes value of each Webpage, text density and/or rubbish word density;
Analysis module, is suitable for analyzing described spatial distribution of nodes value showing that each node of each Webpage forms situation, and by described text density and/or waste density compared with default corresponding threshold value;
Text recognition module, being suitable for above-mentioned text density and/or the rubbish word density content recognition dropped in corresponding threshold value is text.
8. the system as described in any one of claim 1-7, is characterized in that,
Described spatial distribution of nodes value, represents that the child node of a node forms situation, comprises the ratio that the number of various label and label account for child node;
Described text density, represents the average text size that the text size in a node obtains divided by child node number;
Described rubbish word density, represents the value of length divided by text sizes all in node of all rubbish words in a node.
9. identify a method for Web page text, it is characterized in that, described method comprises:
The source code of analyzing web page, carries out layout calculation to analysis result, the dom tree of generating web page;
Travel through from the root node of dom tree, identify the text node in dom tree and/or rubbish word node;
The text node identified is divided by the floor of webpage.
10. method as claimed in claim 9, is characterized in that, the node of described dom tree divides according to the label in homepages language.
CN201410758368.4A 2012-06-25 2012-06-25 A kind of system and method for identifying text floor of webpage Active CN104462394B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410758368.4A CN104462394B (en) 2012-06-25 2012-06-25 A kind of system and method for identifying text floor of webpage

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210214079.9A CN102779170B (en) 2012-06-25 2012-06-25 System and method for identifying text floor of webpage
CN201410758368.4A CN104462394B (en) 2012-06-25 2012-06-25 A kind of system and method for identifying text floor of webpage

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201210214079.9A Division CN102779170B (en) 2012-06-25 2012-06-25 System and method for identifying text floor of webpage

Publications (2)

Publication Number Publication Date
CN104462394A true CN104462394A (en) 2015-03-25
CN104462394B CN104462394B (en) 2018-05-11

Family

ID=52908429

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410758368.4A Active CN104462394B (en) 2012-06-25 2012-06-25 A kind of system and method for identifying text floor of webpage

Country Status (1)

Country Link
CN (1) CN104462394B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203527A (en) * 2016-03-16 2017-09-26 北大方正集团有限公司 The text extracting method and system of news web page
CN108399167A (en) * 2017-02-04 2018-08-14 百度在线网络技术(北京)有限公司 Webpage information extracting method and device
CN110020247A (en) * 2017-12-22 2019-07-16 中移(苏州)软件技术有限公司 A kind of webpage key modules extracting method and device
CN111241446A (en) * 2020-01-13 2020-06-05 杭州安恒信息技术股份有限公司 Method, device, equipment and medium for extracting text content of web page
CN111475146A (en) * 2019-01-24 2020-07-31 阿里健康信息技术有限公司 Method and device for identifying layout element attributes

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100083095A1 (en) * 2008-09-29 2010-04-01 Nikovski Daniel N Method for Extracting Data from Web Pages
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density
CN102420842A (en) * 2010-09-28 2012-04-18 腾讯科技(深圳)有限公司 Method and system for sending webpage in mobile network

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100083095A1 (en) * 2008-09-29 2010-04-01 Nikovski Daniel N Method for Extracting Data from Web Pages
CN102420842A (en) * 2010-09-28 2012-04-18 腾讯科技(深圳)有限公司 Method and system for sending webpage in mobile network
CN102184189A (en) * 2011-04-18 2011-09-14 北京理工大学 Webpage core block determining method based on DOM (Document Object Model) node text density

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
韩杰: "中文BBS信息提取与分类", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203527A (en) * 2016-03-16 2017-09-26 北大方正集团有限公司 The text extracting method and system of news web page
CN107203527B (en) * 2016-03-16 2019-06-28 北大方正集团有限公司 The text extracting method and system of news web page
CN108399167A (en) * 2017-02-04 2018-08-14 百度在线网络技术(北京)有限公司 Webpage information extracting method and device
CN108399167B (en) * 2017-02-04 2022-04-29 百度在线网络技术(北京)有限公司 Webpage information extraction method and device
CN110020247A (en) * 2017-12-22 2019-07-16 中移(苏州)软件技术有限公司 A kind of webpage key modules extracting method and device
CN110020247B (en) * 2017-12-22 2021-05-14 中移(苏州)软件技术有限公司 Webpage key module extraction method and device
CN111475146A (en) * 2019-01-24 2020-07-31 阿里健康信息技术有限公司 Method and device for identifying layout element attributes
CN111241446A (en) * 2020-01-13 2020-06-05 杭州安恒信息技术股份有限公司 Method, device, equipment and medium for extracting text content of web page
CN111241446B (en) * 2020-01-13 2023-10-31 杭州安恒信息技术股份有限公司 Method, device, equipment and medium for extracting text content of web page

Also Published As

Publication number Publication date
CN104462394B (en) 2018-05-11

Similar Documents

Publication Publication Date Title
CN102779170B (en) System and method for identifying text floor of webpage
CN110334346B (en) Information extraction method and device of PDF (Portable document Format) file
Arai et al. Method for real time text extraction of digital manga comic
CN102253979B (en) Vision-based web page extracting method
CN104598577B (en) A kind of extracting method of Web page text
CN103336766B (en) Short text garbage identification and modeling method and device
CN102298638A (en) Method and system for extracting news webpage contents by clustering webpage labels
CN104462394A (en) System and method for recognizing content posts of webpage
CN104504150A (en) News public opinion monitoring system
CN101593200A (en) Chinese Web page classification method based on the keyword frequency analysis
CN109033282B (en) Webpage text extraction method and device based on extraction template
WO2014153457A1 (en) Merging web page style addresses
CN107862039B (en) Webpage data acquisition method and system and data matching and pushing method
CN103123620A (en) Web text sentiment analysis method based on propositional logic
CN107145591B (en) Title-based webpage effective metadata content extraction method
CN112084451A (en) Webpage LOGO extraction system and method based on visual blocking
CN111209831A (en) Document table content identification method and device based on classification algorithm
CN103942332B (en) Web page logic link block identification method
CN106897287B (en) Webpage release time extraction method and device for webpage release time extraction
CN103118028B (en) Based on the security sweep method and system of web analysis
CN104598289A (en) Recognition method and electronic device
Kim et al. Main content extraction from web documents using text block context
Munot et al. Conceptual framework for abstractive text summarization
CN107291952B (en) Method and device for extracting meaningful strings
CN110704617B (en) News text classification method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220727

Address after: 300450 No. 9-3-401, No. 39, Gaoxin 6th Road, Binhai Science Park, Binhai New Area, Tianjin

Patentee after: 3600 Technology Group Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230718

Address after: 1765, floor 17, floor 15, building 3, No. 10 Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: Beijing Hongxiang Technical Service Co.,Ltd.

Address before: 300450 No. 9-3-401, No. 39, Gaoxin 6th Road, Binhai Science Park, Binhai New Area, Tianjin

Patentee before: 3600 Technology Group Co.,Ltd.

TR01 Transfer of patent right