CN104765737A - Method for capturing HTML (HyperText markup Language) contents - Google Patents

Method for capturing HTML (HyperText markup Language) contents Download PDF

Info

Publication number
CN104765737A
CN104765737A CN201410003176.2A CN201410003176A CN104765737A CN 104765737 A CN104765737 A CN 104765737A CN 201410003176 A CN201410003176 A CN 201410003176A CN 104765737 A CN104765737 A CN 104765737A
Authority
CN
China
Prior art keywords
label
target block
block group
content
coated
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410003176.2A
Other languages
Chinese (zh)
Inventor
蔡弘扬
洪启豪
谷鸿祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Esobi Inc
Original Assignee
Esobi Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Esobi Inc filed Critical Esobi Inc
Priority to CN201410003176.2A priority Critical patent/CN104765737A/en
Publication of CN104765737A publication Critical patent/CN104765737A/en
Pending legal-status Critical Current

Links

Abstract

The invention discloses a method for capturing HTML (HyperText markup Language) contents. The method comprises the steps: recording character positions of all objective labels in remaining HTML file paragraphs; by adopting a manner of utilizing a first objective label as a first starting label and setting a plurality of different follow-up starting labels and on the basis of the principle of not exceeding an HTML file header as well as covering the last objective label, performing the steps of segmenting paragraphs respectively; segmenting at least one objective block group; sequentially performing relevance contrast for the objective block groups and the objective labels, thereby deleting the non-important objective block groups; further, outputting the contents of the remaining objective block group, and accurately extracting information including important contents and other necessaries (e.g., pictures and hyperlink information associated with the important contents) from an HTML file.

Description

Capture the method for super word tag language file content
Technical field
The present invention is a kind of extracting process of web page contents, relate to one especially and can extract the method comprising important interior literary composition and the information needed for other (such as relevant to the important interior literary composition information such as picture, hyperlink) from super word tag language (HyperText Markup Language, HTML) file.
Background technology
Existing being correlated with, html file is converted in the technology of pure words content; all only focus on and how important interior literary composition is extracted; its demand is being avoided extracting the word information of unessential rubbish contents and improving the accuracy rate of pure words result; but often have ignored the information such as the picture relevant to important interior literary composition or hyperlink; make user in reading, even often relevent information can be can't see, picture interior literary composition of failing to understand writing anything.
Before inventor, disclosed and approved one is from super word tag language (HyperText MarkupLanguage, HTML) file transform becomes in the method for pure words file, open for <p>, the object labels such as <br> capture important interior civilian paragraph, and using the foundation that a sentence index value preset separates as paragraph, the html file paragraph of reservation is separated into several target block, it is output into pure words word by the target block found out from this several target block further again closest to html file heading-text meaning, though the accuracy extracting important interior literary composition can be improved, but under special circumstances, any paragraph relevant with file title can be regarded as important and capture, such as, reader after important interior literary composition responds, if reader mentions html file title in responding, then this section of reader's response can be extracted in the lump by the paragraph as important interior literary composition, and cause contents extraction mistake, and the method cannot extract the information such as the picture relevant to important content or hyperlink, very unfortunate.
And in the way of the tree-shaped label construction of the known parsing HTML of another kind, though the information such as the picture relevant to important interior literary composition or hyperlink can be extracted, but this technology needs first by whole for whole html file tree-shaped structuring, be important interior literary composition from wherein obtaining one to the content in several node (node) again, this way can need perform under the specific environment can resolving whole html file, both limited and time-consuming in its process, and judge which node is important interior civilian place, often make mistakes, if moreover the words that important interior literary composition adheres to several paragraphs separately and drops in different nodes, be very easy to lose the important interior civilian information in other nodes of dew.
Summary of the invention
For solving the problem, fundamental purpose of the present invention is that providing one accurately can extract from super word tag language file (html file) comprises important interior literary composition and required information (such as relevant to the important interior literary composition information such as picture, hyperlink), the one of being convenient to for user read extracts the method for required content from super word tag language file (HyperText Markup Language, HTML).
For reaching above-mentioned purpose, method system of the present invention first obtains a html file, and the preposition tag processes program that performs is to capture a html file paragraph relevant to main contents, the content that at least one <p> label or <br> label comprise is contained in this html file paragraph, described <p> label and <br> label namely for the purpose of label, the following step is carried out again according to this html file paragraph:
A, from this html file paragraph, search all object labels, and the character position information of those object labels is recorded in a data structure;
B, find out the position of first object label and last object label in this html file paragraph according to the message recorded in this data structure;
C, this first object label be set to one first starting point label and carry out paragraph separate step, till being coated to last object label, using and separate out at least one target block group;
D, set a relating value, (a bit) target block group sequentially should do relevance comparison with this object label and record its relating value, relating value is reached impose a condition should the deletion of (a bit) target block group;
E, the content of (a bit) target block group should be output into required file by remaining.
Wherein, paragraph in step C separates step, system is not to exceed html file title and to be coated to last object label for principle, carry out continuous print, the coated action of diffusion of different target block groups, it is carry out spreading coated action as the first starting point label with first object label, use coated go out a first object block group, if now judge this first object block group be not yet coated to last object label and coated scope does not exceed html file title time, again by the starting point label that setting is different, carry out respectively spreading coated action, make it respectively coated go out several different target block group, till to the last an object label is coated to by one of them target block group.
After target block groups all in this html file paragraph is all wrapped by out, then in step D, by arranging a relating value, sequentially those target block groups and object label (<p> label or <br> label) are done relevance comparison, if this target block group does not comprise this object label, then relating value cumulative 1, and continue to check next target block group, if when this target block group comprises this object label, then relating value is reset to 0, and continue to check next target block group, wherein, a N value (judgement as distance threshold) is more provided with in the detailed step of step D, when the relating value of this target block group be checked through reaches this N value, namely stop checking next target block group, and all target block groups under the target block group at place are deleted, and from this target block group at place, up delete N number of target block group (those deleted target block groups are all regarded as the target block group of insignificant and weed out), the content of the last target block group only residue retained exports, this can reach the object accurately extracting content needed for html file.
Above about content of the present invention and the following explanation system about embodiment in order to demonstration with illustrate spirit of the present invention and principle, and provide claim of the present invention further explained.
Accompanying drawing explanation
Fig. 1 is the flow chart of steps of preferred embodiment of the present invention.
Fig. 2 is the data structure schematic diagram of preferred embodiment of the present invention.
Fig. 3 is the partial steps process flow diagram of preferred embodiment of the present invention.
Fig. 4 is the partial steps process flow diagram of preferred embodiment of the present invention.
Fig. 5 A, 5B, 5C are the example schematic of preferred embodiment of the present invention.
Fig. 6 is the example schematic of preferred embodiment of the present invention.
Fig. 7 A, 7B are the example schematic of preferred embodiment of the present invention.
Wherein, Reference numeral:
10 first object block groups
101 first starting point labels
102 second starting point labels
103 the 3rd starting point labels
20 first, second target block groups
201 second starting point labels
202 the 3rd starting point labels
30 second target block groups
41-43 target block group
421 fourth labels
431 fourth labels
51-55 target block group
Embodiment
Feature for the present invention and embodiment, hereby coordinate the detailed description of accompanying drawing and preferred embodiment to know to present as follows below.
Fig. 1 discloses the flow chart of steps of a kind of preferred embodiment of the present invention, comprising:
S1: obtain a html file;
S2: perform preposition tag processes and capture a html file paragraph relevant to main contents, the content comprising at least one object label in this html file paragraph and comprise.
S3: search all object labels from this html file paragraph, and the character position information of those object labels is recorded in a data structure.
S4: the position finding out first object label and last object label in this html file paragraph according to the message recorded in data structure;
S5: first object label is set to the first starting point label and carries out paragraph separate step, till being coated to last object label, using and separates out at least one target block group.
S6: set a relating value, sequentially does relevance comparison with object label by all target block groups and records its relating value, relating value is reached the target block imposed a condition and deletes.
S7: the content of remaining target block group is output into required file.
Wherein S1, S2 step, fundamental purpose be formerly delete a large amount of unwanted noise in html file (referring to HTML source code) and with less than label, these different labels, have its different function.Because the length of html file is often all very large, but important content only appears at wherein sub-fraction, so the present invention is first for the label range that important content can not occur, first the content of html file is done and significantly delete, to retain useful html file paragraph, this is the preposition action of necessity of acquisition html file important content.
And the present invention is mainly for the html file paragraph remained, carry out the process of extracting useful content.In S3 step, first from this html file paragraph, search all object labels, described object label, be mainly <p> label and <br> label, generally speaking really important in html file content, often all appear near <p> label or <br> label, therefore, the present invention is first recorded in the character position information of all object labels in this html file paragraph in a data structure, the schematic diagram of data structure as shown in Figure 2, these relevent informations include: the character position that the information of each <p> label and <br> label and <p> label and <br> label occur in html file paragraph.So, in S4 step, the place character position of first object label and last object label in this html file paragraph can be found out according to the message recorded in data structure.
From step S5, carry out paragraph of the present invention and separate step, first occur in html file paragraph first object label is set to the first starting point label, namely the step that paragraph separates is carried out first the <p> label occurred from this html file paragraph or <br> label, paragraph of the present invention separates step, mainly through labels all in html file paragraph being carried out the mode of haystack queue, carry out spreading coated action from set starting point label, to realize messy html file paragraph to be distinguished at least one target block group, and then judge whether this target block group belongs to important content.
At this, please also refer to steps flow chart explanation and the example signal of Fig. 3 and Fig. 5 A to 5C, step S51 first upwards finds the leader label falling single by this first starting point label, it is set to the front border of a first object block, please also refer to the example explanation of Fig. 5 A, the the first starting point label set first is found out in step S51, namely first in this html file paragraph <p> label or <br> label occurred, can know in the example of Fig. 5 A illustrates and find out, in this html file paragraph, first object label occurred is <p> label, so in this embodiment, the first starting point label 101 is set to by this first the <p> label occurred, and then upwards find the leader label falling single by this first starting point label 101, namely from this <p> label, upwards carry out the action of label stack queue, the label that each before <p> occurs is inserted in a haystack queue, by the haystack queue principle that last in, first out, when inserting paired leader label and tail tag label in haystack queue, these paired leader label and tail tag label will be suggested haystack queue, and the leader label falling single or tail tag label will leave in haystack queue always, so can find out single label very easily in the process of diffusion, and it is set to the border of this block diffusion.And in the process of label stack queue, some only has the specifically functional label of leader label not put in haystack queue to check, such as <img>, <meta>, <input>, <embed>, <link>, <param>, <area>, <hr>, <col>, the label that < xml> etc. is specifically functional, and avoid the scope of target block group to capture mistake.
And first leader label falling single that step S51 is up found out by first starting point label <p> label in the example of Fig. 5 A, namely <span> leader label can be found, so just this <span> label is set to the front border of first object block.Step S52 then finds out first tail tag label falling single downwards by this first starting point label <p> label, namely can find </span> tail tag label, just it is set to the rear border of this first object block.And in step S53, just by content coated in the front border of aforementioned first object block and rear bounds, be merged into first object block group 10.
When finding out a target block group 10, next just carry out the judgement whether this target block group 10 is coated to last object label (step S54) and whether is coated to html file title (step S55).In step S54, if when the content of this first object block group 10 has been coated to last object label, namely represent all important contents to find, namely skip to step S6 and carry out relating value comparison (illustrating in rear step S6), in this embodiment, the content of this first object block group 10 is not yet coated to last object label, judge whether to be coated to html file title (the <h1>title</h1Gre atT.GreaT.GT in this html file title i.e. 5A example figure) so carry out step S55 again, in this embodiment, the content of first object block group 10 is not yet coated to html file title, representing first object block group 10 can again toward external diffusion, namely continue and carry out the action to external diffusion of step S56, if and when finding that first object block group 10 has been coated to html file title, then carry out other target block of step S62 and spread coated action (illustrating in rear Fig. 6).
In step S56, the front border (the leader label <span> label that this falls single) of first object block 10 is set to the second starting point label 102, and upwards find out the leader label falling single again by the mode by above-mentioned label stack queue that the second starting point label 102 is same, it is set to the front border of the second target block, from Fig. 5 B, first that is upwards found out by the second starting point label 102 leader label falling single are <img>, but in aforementioned explanation, <img> belongs to specifically functional label, check so do not put in haystack queue, up look for again and namely can find <div> leader label, so just this <div> label is set to the front border of the second target block, the rear border (the leader label </span> label that this falls single) of first object block 10 is then set to the 3rd starting point label 103 by step S57, and find out first tail tag label falling single downwards by the 3rd starting point label 103, namely </div> tail tag label can be found by preceding method Sum fanction, just it is set to the rear border of the second target block, again in step S58 by content coated in the front border of the second target block and rear bounds, be merged into the second target block group 20.
When finding out the second target block group 20 outwards diffused out from first object block group 10, next first judge whether the second target block group 20 is coated to the judgement of html file title (step S59).In this embodiment, the content of the second target block group 20 is not yet coated to html file title, representing the second target block group 20 can again toward external diffusion, namely the setting of less first object block group 10 is deleted (as the example signal in Fig. 5 B, first object block group 10 is illustrated with dotted line, the setting representing block is deleted), retain the scope of the second target block group 20, and the second target block group 20 is changed be set to new first object block group (so in figure 5b only surplus next first object block group 20), then step S54 is returned to, S55 carries out judgement and the subsequent step whether this target block group 20 is coated to last object label (step S54) and whether is coated to html file title (step S55), range of scatter is till the content of this target block group is coated to html file title or last object label always.Then show in the example signal of another embodiment Fig. 5 C after first object block group 20 is set up via above-mentioned steps Rule of judgment, continue to find out the second target block group 30 to external diffusion, when now judging that the content of this second target block group 30 has been coated to html file title via step S59, represent the second target block group 30 and exceed required context, now namely carry out the setting that step S61 deletes inappropriate second target block group 30, retain the scope of first object block group 20 (in the example signal of Fig. 5 C, second target block group 30 is illustrated with dotted line, the setting representing block is deleted, the only first object block group 20 that is retained of the surplus next one).
In abovementioned steps S55, can judge whether this target block group is coated to html file title, if when finding that this target block group has been coated to html file title, other target block of then carrying out step S62 spread coated action, please also refer to the example schematic of Fig. 6, as shown in Figure 6, after abovementioned steps, spread coated go out maximum magnitude first object block group 41 after (because diffusion can be coated to html file title again, so first object block group 41 can not spread again, be the maximum magnitude of block content), and this first object block group 41 finds also not to be coated to last object label after step S54 judges, other target block of now namely carrying out step S62 spread coated action, in step S62, the next leader label (i.e. </div>) on border (i.e. </div>) after first object block group 41 are set to fourth label 421, it is treated as the front border of another fresh target block 42, and find out corresponding tail tag label (namely can find the </div> tail tag label that corresponding) downwards by this fourth label 421, these tail tag label are set to the rear border of another fresh target block 42, and then in step S63, by content coated in aforesaid front border and rear bounds, be merged into the target block group 42 that another is new, when after the coated target block group 42 made new advances of diffusion, carry out step S64 again and check whether new target block group 42 is coated to last object label, if when new target block group is not yet coated to last object label, return the coated action of diffusion that step S62 carries out another new target block group 43 more again, next leader label (i.e. <div>) by the rear border (i.e. </div>) of this new target block group 42 are set to fourth label 431, it is treated as the front border of another fresh target block 43, and find out corresponding tail tag label (namely can find the </div> tail tag label that corresponding) downwards by this fourth label 431, these tail tag label are set to the rear border of another fresh target block 43, again in step S63, by content coated in aforesaid front border and rear bounds, remerge into the target block group 43 that another one is new, repeat via above-mentioned steps, till last target block group is coated to last object label, in this embodiment, namely by abovementioned steps, separate coated go out three target block groups 41, 42, 43.
And in step S64, if when the content being checked through the target block group at place has been coated to last object label, namely represent all important contents to find, no longer carry out the action that block spreads coated fresh target block group, namely skip to step S6 and carry out relating value comparison (when the content being checked through first object block group in step S54 has been coated to last object label also with).
Next in step s 6, more the following step is included:
S-61, set a N value;
S-62, sequentially check in each target block group whether comprise this object label, if and this target block group does not comprise this object label, then relating value cumulative 1, and continue to check next target block group, if when this target block group comprises this object label, then relating value is reset to 0, and continues to check next target block group;
If when the relating value of the target block group at S-63 place equals this N value, stop checking next target block group, and all target block groups under this target block group at place are deleted;
S-64, from this target block group at place, up delete N number of target block group.
Please also refer to the example schematic of Fig. 7 A, for convenience of explaining orally, Fig. 7 A supposes via after abovementioned steps, separated in html file paragraph coated go out 5 target block groups 51, 52, 53, 54, 55, although in abovementioned steps S3, we first the information of all object labels in html file paragraph and in html file paragraph character position all have recorded, but the distance of each object label may be far apart, statistics in webpage design practice and inventor is surveyed, between the object label carrying important interior civilian information secretly all can not interval too far away, content apart from object label entrained with too far away is nearly all the information (as the reader bottom webpage responds and so on) of literary composition in insignificant, by signal can in can know and find out, first aim block 51 is coated with object label <p>, and the just coated object label </br> of the 5th target block 55, but the spacing of two objects block is somewhat far away, so in step D, set a relating value and sequentially check each target block, a N value is set again in step S-61, this N value is namely as the judgement of distance threshold between preceding aim block group, in this embodiment, N value is set as 3 by us, and relating value is reached all target block groups imposed a condition and delete and (judge those target block groups and comprise the effective target block group of object tag information at a distance of too far away, belong to the target block group of literary composition in insignificant, although the 5th target block 55 in Fig. 7 A also has coated object label, but exceed the setting of N value distance threshold, will be judged as and the incoherent content of important interior literary composition, and it is deleted, in under will explain orally in detail).
After setting one relating value, sequentially check whether each target block group 51,52,53,54,55 comprises object label (whether having <p> label or <br> label), if the place target block group be checked through does not comprise object label, then relating value cumulative 1, and continue to check next target block group, if when the place target block group be checked through comprises this object label, then relating value is reset to 0, and continues to check next target block group.In the example schematic of the 7th figure, in first aim block group 51, contain the information of object label <p>, so relating value is set to 0, and then down check second target block group 52, in second target block 52, find the information not comprising object label, then by relating value cumulative 1, and then down check the 3rd target block group 53, in the 3rd target block 53, find the information not comprising object label, then relating value is added 1 again, now relating value is accumulated as 2, and then down checks the 4th target block group 54, the information not comprising object label is found in the 4th target block 54, then relating value is added 1 again, now relating value has been accumulated as the N value condition that 3 reach setting, namely stop checking next target block group 55, the target block group 55 judging place and the effective target block group 51 comprising object tag information are at a distance of too far away, belong to the target block group of literary composition in insignificant, so all target block groups 55 under the target block group 54 at place are deleted, and in step S-64, 3 target block groups are up deleted again (because N value is set as 3 from the target block group 54 at place, namely target block group 52 is deleted, 53, 54), to this step, be judged as the target block group 52 of literary composition in insignificant, 53, 54, 55 all deleted (being represented by dotted lines in figure), only retain and be judged to be important interior civilian target block group 51, then carry out step S7 and the content of remaining target block group 51 is output into required file, in the example signal of Fig. 7 A, the content finally exported is by the content (word) for the object label <p> entrained with in target block group 51, if also there are the words of <img> picture tag or <a href> hyperlink label or other label substances in this target block group 51, capital is output into required file in the lump and (namely comprises pure words content and the useful information such as other picture and hyperlink etc. of important interior literary composition, last Output rusults as shown in Figure 7 B), so can reach the object accurately extracting required content in html file.

Claims (6)

1. the method for the super word tag language file content of acquisition, first obtain a html file, and the preposition tag processes program that performs is to capture a html file paragraph relevant to main contents, comprise the content that at least one object label and object label comprise in this html file paragraph, it is characterized in that: the method carries out the following step according to this html file paragraph:
A, from this html file paragraph, search all object labels, and the character position information of those object labels is recorded in a data structure;
B, find out the character position of first object label and last object label in this html file paragraph according to the message recorded in this data structure;
C, this first object label be set to one first starting point label and carry out paragraph separate step, till being coated to last object label, using and separate out at least one target block group;
D, set a relating value, described target block group is sequentially done relevance comparison with this object label and records its relating value, relating value is reached the described target block group imposed a condition and delete; And
E, the content of remaining described target block group is output into required file.
2. the method for the super word tag language file content of acquisition as claimed in claim 1, it is characterized in that, this object label comprises label <p> and label <br>.
3. the method for the super word tag language file content of acquisition as claimed in claim 1, it is characterized in that, the paragraph in step C separates step and comprises:
C-1, upwards find out the leader label falling single by this first starting point label, it is set to a first object block front border;
C-2, find out the tail tag label falling single downwards by this first starting point label, it is set to border after a first object block;
C-3, by content coated in bounds after this first object block front border and this first object block, be merged into a first object block group;
If when the content of this first object block group of C-4 has been coated to last object label, then carry out D step, if when being not yet coated to last object label, proceed next step;
If when the content of this first object block group of C-5 has been coated to html file title, then carried out C-11 step, if when being not yet coated to html file title, proceed next step;
C-6, this first object block front border is set to one second starting point label, and upwards finds out the leader label falling single by this second starting point label, it is set to one second target block front border;
C-7, border after this first object block is set to one the 3rd starting point label, and finds out the tail tag label falling single downwards by the 3rd starting point label, it is set to border after one second target block;
C-8, by content coated in bounds after this second target block front border and this second target block, be merged into one second target block group;
If when the content of this second target block group of C-9 is not coated to html file title, delete the setting of this first object block group, and this second target block group is set to new first object block group, and returns and carry out step C-4;
If when the content of this second target block group of C-10 is coated to html file title, delete the setting of this second target block group, retain the content of this first object block group, and return and carry out step C-4;
C-11, the next leader label on border after this target block at place are set to fourth label, it is treated as the front border of another fresh target block, and find out corresponding tail tag label downwards by this fourth label, it is set to the rear border of this another fresh target block;
C-12, by content coated in bounds after this another fresh target block front border and this another fresh target block, be merged into another fresh target block group; And
If when the content of this another fresh target block group of C-13 is not coated to last object label, then return and carry out step C-11, use and separate out several target block group, till last object label is coated to by one of them target block group in this html file.
4. the method for the super word tag language file content of acquisition as claimed in claim 1, it is characterized in that, step D also comprises:
D-1, set a N value;
D-2, sequentially check in each target block group whether comprise this object label, if and this target block group does not comprise this object label, then relating value cumulative 1, and continue to check next target block group, if when this target block group comprises this object label, then relating value is reset to 0, and continues to check next target block group;
If when the relating value of the target block group at D-3 place equals this N value, stop checking next target block group, and all target block groups under this target block group at place are deleted; And
D-4, from this target block group at place, up delete N number of target block group.
5. the method for the super word tag language file content of acquisition as claimed in claim 4, it is characterized in that, this N value is 3.
6. the method for the super word tag language file content of acquisition as claimed in claim 1, it is characterized in that, the relevent information stored by this data structure comprises: the character position that the information of each object label and this object label occur.
CN201410003176.2A 2014-01-03 2014-01-03 Method for capturing HTML (HyperText markup Language) contents Pending CN104765737A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410003176.2A CN104765737A (en) 2014-01-03 2014-01-03 Method for capturing HTML (HyperText markup Language) contents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410003176.2A CN104765737A (en) 2014-01-03 2014-01-03 Method for capturing HTML (HyperText markup Language) contents

Publications (1)

Publication Number Publication Date
CN104765737A true CN104765737A (en) 2015-07-08

Family

ID=53647572

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410003176.2A Pending CN104765737A (en) 2014-01-03 2014-01-03 Method for capturing HTML (HyperText markup Language) contents

Country Status (1)

Country Link
CN (1) CN104765737A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246481A (en) * 2007-02-16 2008-08-20 易搜比控股公司 Method and system for converting ultra-word indicating language web page into pure words
CN101751403A (en) * 2008-12-11 2010-06-23 易搜比控股公司 Method for transforming hypertext tag language file to text file
CN102420842A (en) * 2010-09-28 2012-04-18 腾讯科技(深圳)有限公司 Method and system for sending webpage in mobile network
US20120254190A1 (en) * 2011-03-31 2012-10-04 Fujitsu Limited Extracting method, computer product, extracting system, information generating method, and information contents

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101246481A (en) * 2007-02-16 2008-08-20 易搜比控股公司 Method and system for converting ultra-word indicating language web page into pure words
CN101751403A (en) * 2008-12-11 2010-06-23 易搜比控股公司 Method for transforming hypertext tag language file to text file
CN102420842A (en) * 2010-09-28 2012-04-18 腾讯科技(深圳)有限公司 Method and system for sending webpage in mobile network
US20120254190A1 (en) * 2011-03-31 2012-10-04 Fujitsu Limited Extracting method, computer product, extracting system, information generating method, and information contents

Similar Documents

Publication Publication Date Title
CN102053991B (en) Method and system for multi-language document retrieval
CN111931775B (en) Method, system, computer device and storage medium for automatically acquiring news headlines
CN108509417B (en) Title generation method and device, storage medium and server
CN102663023A (en) Implementation method for extracting web content
CN110399291A (en) User Page test method and relevant device based on image recognition
CN106354861A (en) Automatic film label indexing method and automatic indexing system
CN102314497B (en) Method and equipment for identifying body contents of markup language files
KR20160132842A (en) Detecting and extracting image document components to create flow document
CN104268192B (en) A kind of webpage information extracting method, device and terminal
CN103198069A (en) Method and device for extracting relational table
CN109803152A (en) Violation checking method, device, electronic equipment and storage medium
CN106095985B (en) A kind of method of dynamic collection and cluster web pages information
Al-Zaidy et al. Automatic summary generation for scientific data charts
US20100146381A1 (en) Method of establishing a plain text document from a html document
CN103425765A (en) Method and device for extracting webpage text and method and system for webpage preview
CN100432996C (en) System, method and program for extracting web page core content based on web page layout
CN109144513B (en) Method for automatically extracting list page
CN109710628B (en) Information processing method, information processing device, information processing system, computer and readable storage medium
US9049400B2 (en) Image processing apparatus, and image processing method and program
CN104156458B (en) The extracting method and device of a kind of information
CN111860122B (en) Method and system for identifying reading comprehensive behaviors in real scene
CN109145261B (en) Method and device for generating label
CN111986259A (en) Training method of character and face detection model, auditing method of video data and related device
KR101105798B1 (en) Apparatus and method refining keyword and contents searching system and method
CN104765737A (en) Method for capturing HTML (HyperText markup Language) contents

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150708

WD01 Invention patent application deemed withdrawn after publication