CN104765737A

CN104765737A - Method for capturing HTML (HyperText markup Language) contents

Info

Publication number: CN104765737A
Application number: CN201410003176.2A
Authority: CN
Inventors: 蔡弘扬; 洪启豪; 谷鸿祥
Original assignee: Esobi Inc
Current assignee: Esobi Inc
Priority date: 2014-01-03
Filing date: 2014-01-03
Publication date: 2015-07-08

Abstract

The invention discloses a method for capturing HTML (HyperText markup Language) contents. The method comprises the steps: recording character positions of all objective labels in remaining HTML file paragraphs; by adopting a manner of utilizing a first objective label as a first starting label and setting a plurality of different follow-up starting labels and on the basis of the principle of not exceeding an HTML file header as well as covering the last objective label, performing the steps of segmenting paragraphs respectively; segmenting at least one objective block group; sequentially performing relevance contrast for the objective block groups and the objective labels, thereby deleting the non-important objective block groups; further, outputting the contents of the remaining objective block group, and accurately extracting information including important contents and other necessaries (e.g., pictures and hyperlink information associated with the important contents) from an HTML file.

Description

Capture the method for super word tag language file content

Technical field

The present invention is a kind of extracting process of web page contents, relate to one especially and can extract the method comprising important interior literary composition and the information needed for other (such as relevant to the important interior literary composition information such as picture, hyperlink) from super word tag language (HyperText Markup Language, HTML) file.

Background technology

Existing being correlated with, html file is converted in the technology of pure words content; all only focus on and how important interior literary composition is extracted; its demand is being avoided extracting the word information of unessential rubbish contents and improving the accuracy rate of pure words result; but often have ignored the information such as the picture relevant to important interior literary composition or hyperlink; make user in reading, even often relevent information can be can't see, picture interior literary composition of failing to understand writing anything.

Before inventor, disclosed and approved one is from super word tag language (HyperText MarkupLanguage, HTML) file transform becomes in the method for pure words file, open for <p>, the object labels such as <br> capture important interior civilian paragraph, and using the foundation that a sentence index value preset separates as paragraph, the html file paragraph of reservation is separated into several target block, it is output into pure words word by the target block found out from this several target block further again closest to html file heading-text meaning, though the accuracy extracting important interior literary composition can be improved, but under special circumstances, any paragraph relevant with file title can be regarded as important and capture, such as, reader after important interior literary composition responds, if reader mentions html file title in responding, then this section of reader's response can be extracted in the lump by the paragraph as important interior literary composition, and cause contents extraction mistake, and the method cannot extract the information such as the picture relevant to important content or hyperlink, very unfortunate.

And in the way of the tree-shaped label construction of the known parsing HTML of another kind, though the information such as the picture relevant to important interior literary composition or hyperlink can be extracted, but this technology needs first by whole for whole html file tree-shaped structuring, be important interior literary composition from wherein obtaining one to the content in several node (node) again, this way can need perform under the specific environment can resolving whole html file, both limited and time-consuming in its process, and judge which node is important interior civilian place, often make mistakes, if moreover the words that important interior literary composition adheres to several paragraphs separately and drops in different nodes, be very easy to lose the important interior civilian information in other nodes of dew.

Summary of the invention

For solving the problem, fundamental purpose of the present invention is that providing one accurately can extract from super word tag language file (html file) comprises important interior literary composition and required information (such as relevant to the important interior literary composition information such as picture, hyperlink), the one of being convenient to for user read extracts the method for required content from super word tag language file (HyperText Markup Language, HTML).

For reaching above-mentioned purpose, method system of the present invention first obtains a html file, and the preposition tag processes program that performs is to capture a html file paragraph relevant to main contents, the content that at least one <p> label or <br> label comprise is contained in this html file paragraph, described <p> label and <br> label namely for the purpose of label, the following step is carried out again according to this html file paragraph:

A, from this html file paragraph, search all object labels, and the character position information of those object labels is recorded in a data structure;

B, find out the position of first object label and last object label in this html file paragraph according to the message recorded in this data structure;

C, this first object label be set to one first starting point label and carry out paragraph separate step, till being coated to last object label, using and separate out at least one target block group;

D, set a relating value, (a bit) target block group sequentially should do relevance comparison with this object label and record its relating value, relating value is reached impose a condition should the deletion of (a bit) target block group;

E, the content of (a bit) target block group should be output into required file by remaining.

Wherein, paragraph in step C separates step, system is not to exceed html file title and to be coated to last object label for principle, carry out continuous print, the coated action of diffusion of different target block groups, it is carry out spreading coated action as the first starting point label with first object label, use coated go out a first object block group, if now judge this first object block group be not yet coated to last object label and coated scope does not exceed html file title time, again by the starting point label that setting is different, carry out respectively spreading coated action, make it respectively coated go out several different target block group, till to the last an object label is coated to by one of them target block group.

After target block groups all in this html file paragraph is all wrapped by out, then in step D, by arranging a relating value, sequentially those target block groups and object label (<p> label or <br> label) are done relevance comparison, if this target block group does not comprise this object label, then relating value cumulative 1, and continue to check next target block group, if when this target block group comprises this object label, then relating value is reset to 0, and continue to check next target block group, wherein, a N value (judgement as distance threshold) is more provided with in the detailed step of step D, when the relating value of this target block group be checked through reaches this N value, namely stop checking next target block group, and all target block groups under the target block group at place are deleted, and from this target block group at place, up delete N number of target block group (those deleted target block groups are all regarded as the target block group of insignificant and weed out), the content of the last target block group only residue retained exports, this can reach the object accurately extracting content needed for html file.

Above about content of the present invention and the following explanation system about embodiment in order to demonstration with illustrate spirit of the present invention and principle, and provide claim of the present invention further explained.

Accompanying drawing explanation

Fig. 1 is the flow chart of steps of preferred embodiment of the present invention.

Fig. 2 is the data structure schematic diagram of preferred embodiment of the present invention.

Fig. 3 is the partial steps process flow diagram of preferred embodiment of the present invention.

Fig. 4 is the partial steps process flow diagram of preferred embodiment of the present invention.

Fig. 5 A, 5B, 5C are the example schematic of preferred embodiment of the present invention.

Fig. 6 is the example schematic of preferred embodiment of the present invention.

Fig. 7 A, 7B are the example schematic of preferred embodiment of the present invention.

Wherein, Reference numeral:

10 first object block groups

101 first starting point labels

102 second starting point labels

103 the 3rd starting point labels

20 first, second target block groups

201 second starting point labels

202 the 3rd starting point labels

30 second target block groups

41-43 target block group

421 fourth labels

431 fourth labels

51-55 target block group

Embodiment

Feature for the present invention and embodiment, hereby coordinate the detailed description of accompanying drawing and preferred embodiment to know to present as follows below.

Fig. 1 discloses the flow chart of steps of a kind of preferred embodiment of the present invention, comprising:

S1: obtain a html file;

S2: perform preposition tag processes and capture a html file paragraph relevant to main contents, the content comprising at least one object label in this html file paragraph and comprise.

S3: search all object labels from this html file paragraph, and the character position information of those object labels is recorded in a data structure.

S4: the position finding out first object label and last object label in this html file paragraph according to the message recorded in data structure;

S5: first object label is set to the first starting point label and carries out paragraph separate step, till being coated to last object label, using and separates out at least one target block group.

S6: set a relating value, sequentially does relevance comparison with object label by all target block groups and records its relating value, relating value is reached the target block imposed a condition and deletes.

S7: the content of remaining target block group is output into required file.

Wherein S1, S2 step, fundamental purpose be formerly delete a large amount of unwanted noise in html file (referring to HTML source code) and with less than label, these different labels, have its different function.Because the length of html file is often all very large, but important content only appears at wherein sub-fraction, so the present invention is first for the label range that important content can not occur, first the content of html file is done and significantly delete, to retain useful html file paragraph, this is the preposition action of necessity of acquisition html file important content.

And the present invention is mainly for the html file paragraph remained, carry out the process of extracting useful content.In S3 step, first from this html file paragraph, search all object labels, described object label, be mainly <p> label and <br> label, generally speaking really important in html file content, often all appear near <p> label or <br> label, therefore, the present invention is first recorded in the character position information of all object labels in this html file paragraph in a data structure, the schematic diagram of data structure as shown in Figure 2, these relevent informations include: the character position that the information of each <p> label and <br> label and <p> label and <br> label occur in html file paragraph.So, in S4 step, the place character position of first object label and last object label in this html file paragraph can be found out according to the message recorded in data structure.

From step S5, carry out paragraph of the present invention and separate step, first occur in html file paragraph first object label is set to the first starting point label, namely the step that paragraph separates is carried out first the <p> label occurred from this html file paragraph or <br> label, paragraph of the present invention separates step, mainly through labels all in html file paragraph being carried out the mode of haystack queue, carry out spreading coated action from set starting point label, to realize messy html file paragraph to be distinguished at least one target block group, and then judge whether this target block group belongs to important content.

At this, please also refer to steps flow chart explanation and the example signal of Fig. 3 and Fig. 5 A to 5C, step S51 first upwards finds the leader label falling single by this first starting point label, it is set to the front border of a first object block, please also refer to the example explanation of Fig. 5 A, the the first starting point label set first is found out in step S51, namely first in this html file paragraph <p> label or <br> label occurred, can know in the example of Fig. 5 A illustrates and find out, in this html file paragraph, first object label occurred is <p> label, so in this embodiment, the first starting point label 101 is set to by this first the <p> label occurred, and then upwards find the leader label falling single by this first starting point label 101, namely from this <p> label, upwards carry out the action of label stack queue, the label that each before <p> occurs is inserted in a haystack queue, by the haystack queue principle that last in, first out, when inserting paired leader label and tail tag label in haystack queue, these paired leader label and tail tag label will be suggested haystack queue, and the leader label falling single or tail tag label will leave in haystack queue always, so can find out single label very easily in the process of diffusion, and it is set to the border of this block diffusion.And in the process of label stack queue, some only has the specifically functional label of leader label not put in haystack queue to check, such as <img>, <meta>, <input>, <embed>, <link>, <param>, <area>, <hr>, <col>, the label that < xml> etc. is specifically functional, and avoid the scope of target block group to capture mistake.

And first leader label falling single that step S51 is up found out by first starting point label <p> label in the example of Fig. 5 A, namely <span> leader label can be found, so just this <span> label is set to the front border of first object block.Step S52 then finds out first tail tag label falling single downwards by this first starting point label <p> label, namely can find </span> tail tag label, just it is set to the rear border of this first object block.And in step S53, just by content coated in the front border of aforementioned first object block and rear bounds, be merged into first object block group 10.

When finding out a target block group 10, next just carry out the judgement whether this target block group 10 is coated to last object label (step S54) and whether is coated to html file title (step S55).In step S54, if when the content of this first object block group 10 has been coated to last object label, namely represent all important contents to find, namely skip to step S6 and carry out relating value comparison (illustrating in rear step S6), in this embodiment, the content of this first object block group 10 is not yet coated to last object label, judge whether to be coated to html file title (the <h1>title</h1Gre atT.GreaT.GT in this html file title i.e. 5A example figure) so carry out step S55 again, in this embodiment, the content of first object block group 10 is not yet coated to html file title, representing first object block group 10 can again toward external diffusion, namely continue and carry out the action to external diffusion of step S56, if and when finding that first object block group 10 has been coated to html file title, then carry out other target block of step S62 and spread coated action (illustrating in rear Fig. 6).

In step S56, the front border (the leader label <span> label that this falls single) of first object block 10 is set to the second starting point label 102, and upwards find out the leader label falling single again by the mode by above-mentioned label stack queue that the second starting point label 102 is same, it is set to the front border of the second target block, from Fig. 5 B, first that is upwards found out by the second starting point label 102 leader label falling single are <img>, but in aforementioned explanation, <img> belongs to specifically functional label, check so do not put in haystack queue, up look for again and namely can find <div> leader label, so just this <div> label is set to the front border of the second target block, the rear border (the leader label </span> label that this falls single) of first object block 10 is then set to the 3rd starting point label 103 by step S57, and find out first tail tag label falling single downwards by the 3rd starting point label 103, namely </div> tail tag label can be found by preceding method Sum fanction, just it is set to the rear border of the second target block, again in step S58 by content coated in the front border of the second target block and rear bounds, be merged into the second target block group 20.

When finding out the second target block group 20 outwards diffused out from first object block group 10, next first judge whether the second target block group 20 is coated to the judgement of html file title (step S59).In this embodiment, the content of the second target block group 20 is not yet coated to html file title, representing the second target block group 20 can again toward external diffusion, namely the setting of less first object block group 10 is deleted (as the example signal in Fig. 5 B, first object block group 10 is illustrated with dotted line, the setting representing block is deleted), retain the scope of the second target block group 20, and the second target block group 20 is changed be set to new first object block group (so in figure 5b only surplus next first object block group 20), then step S54 is returned to, S55 carries out judgement and the subsequent step whether this target block group 20 is coated to last object label (step S54) and whether is coated to html file title (step S55), range of scatter is till the content of this target block group is coated to html file title or last object label always.Then show in the example signal of another embodiment Fig. 5 C after first object block group 20 is set up via above-mentioned steps Rule of judgment, continue to find out the second target block group 30 to external diffusion, when now judging that the content of this second target block group 30 has been coated to html file title via step S59, represent the second target block group 30 and exceed required context, now namely carry out the setting that step S61 deletes inappropriate second target block group 30, retain the scope of first object block group 20 (in the example signal of Fig. 5 C, second target block group 30 is illustrated with dotted line, the setting representing block is deleted, the only first object block group 20 that is retained of the surplus next one).

In abovementioned steps S55, can judge whether this target block group is coated to html file title, if when finding that this target block group has been coated to html file title, other target block of then carrying out step S62 spread coated action, please also refer to the example schematic of Fig. 6, as shown in Figure 6, after abovementioned steps, spread coated go out maximum magnitude first object block group 41 after (because diffusion can be coated to html file title again, so first object block group 41 can not spread again, be the maximum magnitude of block content), and this first object block group 41 finds also not to be coated to last object label after step S54 judges, other target block of now namely carrying out step S62 spread coated action, in step S62, the next leader label (i.e. </div>) on border (i.e. </div>) after first object block group 41 are set to fourth label 421, it is treated as the front border of another fresh target block 42, and find out corresponding tail tag label (namely can find the </div> tail tag label that corresponding) downwards by this fourth label 421, these tail tag label are set to the rear border of another fresh target block 42, and then in step S63, by content coated in aforesaid front border and rear bounds, be merged into the target block group 42 that another is new, when after the coated target block group 42 made new advances of diffusion, carry out step S64 again and check whether new target block group 42 is coated to last object label, if when new target block group is not yet coated to last object label, return the coated action of diffusion that step S62 carries out another new target block group 43 more again, next leader label (i.e. <div>) by the rear border (i.e. </div>) of this new target block group 42 are set to fourth label 431, it is treated as the front border of another fresh target block 43, and find out corresponding tail tag label (namely can find the </div> tail tag label that corresponding) downwards by this fourth label 431, these tail tag label are set to the rear border of another fresh target block 43, again in step S63, by content coated in aforesaid front border and rear bounds, remerge into the target block group 43 that another one is new, repeat via above-mentioned steps, till last target block group is coated to last object label, in this embodiment, namely by abovementioned steps, separate coated go out three target block groups 41, 42, 43.

And in step S64, if when the content being checked through the target block group at place has been coated to last object label, namely represent all important contents to find, no longer carry out the action that block spreads coated fresh target block group, namely skip to step S6 and carry out relating value comparison (when the content being checked through first object block group in step S54 has been coated to last object label also with).

Next in step s 6, more the following step is included:

S-61, set a N value;

S-62, sequentially check in each target block group whether comprise this object label, if and this target block group does not comprise this object label, then relating value cumulative 1, and continue to check next target block group, if when this target block group comprises this object label, then relating value is reset to 0, and continues to check next target block group;

If when the relating value of the target block group at S-63 place equals this N value, stop checking next target block group, and all target block groups under this target block group at place are deleted;

S-64, from this target block group at place, up delete N number of target block group.

Please also refer to the example schematic of Fig. 7 A, for convenience of explaining orally, Fig. 7 A supposes via after abovementioned steps, separated in html file paragraph coated go out 5 target block groups 51, 52, 53, 54, 55, although in abovementioned steps S3, we first the information of all object labels in html file paragraph and in html file paragraph character position all have recorded, but the distance of each object label may be far apart, statistics in webpage design practice and inventor is surveyed, between the object label carrying important interior civilian information secretly all can not interval too far away, content apart from object label entrained with too far away is nearly all the information (as the reader bottom webpage responds and so on) of literary composition in insignificant, by signal can in can know and find out, first aim block 51 is coated with object label <p>, and the just coated object label </br> of the 5th target block 55, but the spacing of two objects block is somewhat far away, so in step D, set a relating value and sequentially check each target block, a N value is set again in step S-61, this N value is namely as the judgement of distance threshold between preceding aim block group, in this embodiment, N value is set as 3 by us, and relating value is reached all target block groups imposed a condition and delete and (judge those target block groups and comprise the effective target block group of object tag information at a distance of too far away, belong to the target block group of literary composition in insignificant, although the 5th target block 55 in Fig. 7 A also has coated object label, but exceed the setting of N value distance threshold, will be judged as and the incoherent content of important interior literary composition, and it is deleted, in under will explain orally in detail).

After setting one relating value, sequentially check whether each target block group 51,52,53,54,55 comprises object label (whether having <p> label or <br> label), if the place target block group be checked through does not comprise object label, then relating value cumulative 1, and continue to check next target block group, if when the place target block group be checked through comprises this object label, then relating value is reset to 0, and continues to check next target block group.In the example schematic of the 7th figure, in first aim block group 51, contain the information of object label <p>, so relating value is set to 0, and then down check second target block group 52, in second target block 52, find the information not comprising object label, then by relating value cumulative 1, and then down check the 3rd target block group 53, in the 3rd target block 53, find the information not comprising object label, then relating value is added 1 again, now relating value is accumulated as 2, and then down checks the 4th target block group 54, the information not comprising object label is found in the 4th target block 54, then relating value is added 1 again, now relating value has been accumulated as the N value condition that 3 reach setting, namely stop checking next target block group 55, the target block group 55 judging place and the effective target block group 51 comprising object tag information are at a distance of too far away, belong to the target block group of literary composition in insignificant, so all target block groups 55 under the target block group 54 at place are deleted, and in step S-64, 3 target block groups are up deleted again (because N value is set as 3 from the target block group 54 at place, namely target block group 52 is deleted, 53, 54), to this step, be judged as the target block group 52 of literary composition in insignificant, 53, 54, 55 all deleted (being represented by dotted lines in figure), only retain and be judged to be important interior civilian target block group 51, then carry out step S7 and the content of remaining target block group 51 is output into required file, in the example signal of Fig. 7 A, the content finally exported is by the content (word) for the object label <p> entrained with in target block group 51, if also there are the words of <img> picture tag or <a href> hyperlink label or other label substances in this target block group 51, capital is output into required file in the lump and (namely comprises pure words content and the useful information such as other picture and hyperlink etc. of important interior literary composition, last Output rusults as shown in Figure 7 B), so can reach the object accurately extracting required content in html file.

Claims

1. the method for the super word tag language file content of acquisition, first obtain a html file, and the preposition tag processes program that performs is to capture a html file paragraph relevant to main contents, comprise the content that at least one object label and object label comprise in this html file paragraph, it is characterized in that: the method carries out the following step according to this html file paragraph:

B, find out the character position of first object label and last object label in this html file paragraph according to the message recorded in this data structure;

D, set a relating value, described target block group is sequentially done relevance comparison with this object label and records its relating value, relating value is reached the described target block group imposed a condition and delete; And

E, the content of remaining described target block group is output into required file.

2. the method for the super word tag language file content of acquisition as claimed in claim 1, it is characterized in that, this object label comprises label <p> and label <br>.

3. the method for the super word tag language file content of acquisition as claimed in claim 1, it is characterized in that, the paragraph in step C separates step and comprises:

C-1, upwards find out the leader label falling single by this first starting point label, it is set to a first object block front border;

C-2, find out the tail tag label falling single downwards by this first starting point label, it is set to border after a first object block;

C-3, by content coated in bounds after this first object block front border and this first object block, be merged into a first object block group;

If when the content of this first object block group of C-4 has been coated to last object label, then carry out D step, if when being not yet coated to last object label, proceed next step;

If when the content of this first object block group of C-5 has been coated to html file title, then carried out C-11 step, if when being not yet coated to html file title, proceed next step;

C-6, this first object block front border is set to one second starting point label, and upwards finds out the leader label falling single by this second starting point label, it is set to one second target block front border;

C-7, border after this first object block is set to one the 3rd starting point label, and finds out the tail tag label falling single downwards by the 3rd starting point label, it is set to border after one second target block;

C-8, by content coated in bounds after this second target block front border and this second target block, be merged into one second target block group;

If when the content of this second target block group of C-9 is not coated to html file title, delete the setting of this first object block group, and this second target block group is set to new first object block group, and returns and carry out step C-4;

If when the content of this second target block group of C-10 is coated to html file title, delete the setting of this second target block group, retain the content of this first object block group, and return and carry out step C-4;

C-11, the next leader label on border after this target block at place are set to fourth label, it is treated as the front border of another fresh target block, and find out corresponding tail tag label downwards by this fourth label, it is set to the rear border of this another fresh target block;

C-12, by content coated in bounds after this another fresh target block front border and this another fresh target block, be merged into another fresh target block group; And

If when the content of this another fresh target block group of C-13 is not coated to last object label, then return and carry out step C-11, use and separate out several target block group, till last object label is coated to by one of them target block group in this html file.

4. the method for the super word tag language file content of acquisition as claimed in claim 1, it is characterized in that, step D also comprises:

D-1, set a N value;

D-2, sequentially check in each target block group whether comprise this object label, if and this target block group does not comprise this object label, then relating value cumulative 1, and continue to check next target block group, if when this target block group comprises this object label, then relating value is reset to 0, and continues to check next target block group;

If when the relating value of the target block group at D-3 place equals this N value, stop checking next target block group, and all target block groups under this target block group at place are deleted; And

D-4, from this target block group at place, up delete N number of target block group.

5. the method for the super word tag language file content of acquisition as claimed in claim 4, it is characterized in that, this N value is 3.

6. the method for the super word tag language file content of acquisition as claimed in claim 1, it is characterized in that, the relevent information stored by this data structure comprises: the character position that the information of each object label and this object label occur.