Embodiment
The invention provides a kind of method for acquiring internet subject information and device, need not stick to unified network template, and a kind of method in common is provided, accurately analyze and handle webpages all on the internet, to obtain subject information.
Referring to Fig. 1, the schematic flow sheet of the embodiment one of the method for acquiring internet subject information that provides for the embodiment of the invention.
The method for acquiring internet subject information that the embodiment of the invention provides comprises:
Step 100 is obtained the HTML (Hypertext Markup Language) html source code of internet web page;
Need to prove that HTML is the abbreviation of hypertext language, generally is used to write webpage,, can understand the structure of this webpage and the specific address of some pictures or video by checking the html source code of webpage on the network.
Step 101, with the div label serves as that the sign label is divided into different character strings with described html source code, and with described different character string formation character string tabulation (according to the notification of examiner's opinion that my department received in the past, best herein and inventor links up down again, gives the concrete format write of character string);
Need to prove, html tag normally the full name of english vocabulary (quote as piece: blockquote) or abbreviation (representing Paragraph), but they have any different with general text as " p " because they are placed in single punctuation marks used to enclose the title.So the Paragragh label is<p 〉, piece is quoted label and is<blockquote 〉.Some html tag instruction page is formatted (for example, beginning a new paragraph) how, and other illustrate then how these speech show, and (<b〉make literal chap) also has some other labels to be provided at the information that does not show on the page, for example title.
Html tag becomes two and occurs.Whenever using a label, as<blockquote 〉, then must with another label</blockquote it is closed.Slash before the blockquote is closed label and the difference of opening label exactly.But some label exceptions are arranged also.Such as,<input〉label just do not need.
Usually, html source code begins with DOCTYPE, the type of its statement document, and before it any content (comprising newline and space) can not be arranged, otherwise will make the document statement invalid, and then be<html〉label, with</html〉the label end.<html〉label and</html label also is a kind of in the html tag, between them, full page has two parts, title and text.Wherein, heading is clipped in<head〉label and</head between the label, this word appears at the minimized window of bottom of screen when opening the page.Text then is clipped in<body〉label and</body between the label, i.e. the content place of all pages.Anything that shows on the page is included among these two labels.
The div label is a kind of in the html tag, is to be used for providing for the content of bulk in the html source code (block-level) element of structure and background.The div label comprises: start-tag<div〉and end-tag</div 〉, all the elements between these two labels all are used for constituting this piece, wherein the characteristic of institute's containing element is controlled by the attribute of div label, or by using this piece of fstyleformat.scrolltrackization to control.
The div label is called and separates mark, and its effect is: the putting position of setting word, picture, form etc.When literal, image, or other be placed in the div label, it can be referred to as " DIV block ", or " DIV element " or " CSS-layer ", or is " layer " i.e. " level ".
Because the div label is all arranged in the html source code of the webpage of any template, with the div label html source code is divided into character string, do not need to consider that this webpage is the template of which kind of type, so have versatility;
For example, this section HTML has simulated the structure of news website below.Each div label is wherein combined the title of every news and summary.
<body>
<h1>NEWS?WEBSITE</h?1>
<p>some?text.some?text.some?text...</p>
...
<div?class=″news″>
<h2>News?headline?1</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
<div?class=″news″>
<h2>News?headline?2</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
...
</body>
In the present embodiment, with the div label is the sign label, promptly with<div〉and</div〉be the boundary, with each group<div〉and</div in the character string that comprises extract separately, for example, with first group<div in the above-mentioned html source code〉and</div between character string extract as first character string, that is:
First character string is:
<div?class=″news″>
<h2>News?headline?1</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
Then, again with second group<div in the above-mentioned html source code〉and</div between character string extract as second character string, that is:
Second character string is:
<div?class=″news″>
<h2>News?headline?2</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
By that analogy, will own<div〉and</div between character string extract with this, form the character string tabulation.
Step 102 is analyzed each character string in the character string tabulation, one by one to analyze subject information;
Concrete, analyze each character string in the described character string tabulation one by one, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
Therefore, at the character string of dividing with the div label, by character outside the more above-mentioned various html tags and the character number in the html tag, if the character number outside the html tag is greater than the character number in the html tag, and, can judge that then the content in this character string is obtained subject information greater than predetermined radix value.
Enforcement the invention provides a kind of method for acquiring internet subject information, need not stick to unified network template, and provide a kind of method in common, with the div label html source code is divided into different character strings, and each character string carried out analyzing and processing, thereby can accurately analyze and handle webpages all on the internet, to obtain subject information.
Referring to Fig. 2, be the schematic flow sheet of a kind of method for acquiring internet subject information embodiment two of providing among the present invention.
Need to prove that at first the method that the embodiment of the invention provides both can be used to gather theme of news information, also can be used to gather the daily record subject information; The subject information of Cai Jiing is the difference of news information or daily record subject information as required, and whether the character number outside the html tag in analyzing character string during greater than some radix values, can this radix value be set to difference.
Step 200 is downloaded extend markup language (XML, the Extensible Markup Language) page, extracts list information;
Concrete, if need to gather theme of news information, then download the XML page, therefrom extract news list information; If gather the daily record subject information, then from the XML page of downloading, extract log list information;
Step 201 is downloaded the uniform resource position mark URL in the described list information, in order to obtain the html source code of subject information place webpage.
Concrete, can obtain the html source code of the theme of news information place page, perhaps obtain the source code of the HTML of daily record subject information place webpage.
Step 202, filter in the described html source code html label irrelevant with subject information (that is,<html label and</html label).
Concrete, filter out the html tag that had nothing to do in new day with theme of news information or daily record theme in the html source code, for example script label, style label, object label, iframe label, form label;
Step 203 is obtained the html source code of internet web page;
In the present embodiment, because this html source code has filtered out and theme of news information or the irrelevant html tag of daily record subject information, therefore than a last embodiment, improved efficient, laid a good foundation for improving the accuracy of gathering subject information for analyzing character string.
Step 204 serves as that the sign label is divided into different character strings with described html source code with the div label, and described different character string is formed the character string tabulation.
For example, this section HTML has simulated the structure of news website below.Each div label is wherein combined the title of every news and summary.
<body>
<h1>NEWS?WEBSITE</h1>
<p>some?text.some?text.some?text...</p>
...
<div?class=″news″>
<h2>News?headline?1</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
<div?class=″news″>
<h2>News?headline?2</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
...
</body>
In the present embodiment, with the div label is the sign label, promptly with<div〉and</div〉be the boundary, with each group<div〉and</div in the character string that comprises extract separately, for example, with first group<div in the above-mentioned html source code〉and</div between character string extract as first character string, that is:
First character string is:
<div?class=″news″>
<h2>News?headline?1</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
Then, again with second group<div in the above-mentioned html source code〉and</div between character string extract as second character string, that is:
Second character string is:
<div?class=″news″>
<h2>News?headline?2</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
By that analogy, will own<div〉and</div between character string extract with this, form the character string tabulation.
Step 205, analyze each character string in the described character string tabulation one by one, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
Need to prove that if will to gather subject information be theme of news information, then described radix is set to 50, less than this value, generally not theme of news information;
In order on the basis of embodiment one, further to improve the accuracy of gathering subject information, in the present embodiment two, also comprise:
Step 206 is obtained in the described character string tabulation character string of the outer number of characters maximum of html tag;
Step 207 is analyzed in the described character string tabulation the preceding character string of the character string of the outer number of characters maximum of described html tag and back character string;
Particularly, if the character number before described outside character string and/or the satisfied html tag wherein of back character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this preceding character string and/or the back character string as subject information.
Step 208, character string and/or described back character string before analyzing are to obtain character string as a result;
Particularly, if the character number in described preceding character string and/or the described back character string outside the satisfied html tag wherein is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this preceding character string and/or back character string with the character string of the outer number of characters maximum of described html tag character string as a result of;
Step 209 is handled described character string as a result, to gather subject information.
At last, step 210 is preserved the subject information and this character string that comprise in the described character string through step 209 processing, uses for secondary development.
Enforcement the invention provides a kind of method for acquiring internet subject information, need not stick to unified network template, a kind of method in common is provided, at first html source code is divided into different character strings with the div label, and each character string carried out analyzing and processing, can accurately analyze and handle webpages all on the internet, and character string is by analysis carried out secondary analysis, further improve the accuracy of analyzing webpage on the internet, thereby collect subject information fast and accurately.
Referring to Fig. 3, be the schematic flow sheet of a kind of method for acquiring internet subject information embodiment three among the present invention.
To describe the step 209 among the embodiment two in the present embodiment in detail, it specifically comprises:
Step 300, the character that each html tag in the character string as a result is outer compares with filtering key word, filters and the irrelevant character of subject information to be collected;
Described filtration key word is scheduled to, and is specially illegal key word or advertisement keywords, navigation bar key word, survey key word or the like and the irrelevant noise information of subject information;
Step 301 is extracted all picture image labels in the described filtration character string as a result afterwards, and the download pictures resource is also preserved; Can also obtain simultaneously picture width and height;
Step 302 replaces with the local resource path with the Internet resources path in the described character string as a result;
Step 303 keeps paragraph p label and picture image label in the described character string as a result, deletes other labels in the described character string as a result.
At last, the subject information and this character string that comprise in the described character string through the processing of 300~step 303 are preserved, use for secondary development.
Enforcement the invention provides a kind of method for acquiring internet subject information, in conjunction with the embodiments one and embodiment two accurately gather on the basis of subject informations fast, to the further purified treatment of subject information of gathering, and news or the original form of daily record have been kept, can also keep the picture in original webpage, therefore can better be used by secondary development.
Referring to Fig. 4, be the structural representation of a kind of acquiring internet subject information device embodiment one among the present invention.
The acquiring internet subject information device of present embodiment comprises: source code acquisition module 10, character string form the module 11 and the first string analysis module 12, and their function and effect are as follows:
Source code acquisition module 10 is used to obtain the html source code of internet web page;
In the time of concrete enforcement, this source code acquisition module 10 is used for carrying out the step 100 of aforementioned method for acquiring internet subject information embodiment one (back abbreviation method embodiment one);
Character string forms module 11, and being used for the div label serves as that the sign label is divided into different character strings with described html source code, and described different character string is formed the character string tabulation;
Need to prove, html tag normally the full name of english vocabulary (quote as piece: blockquote) or abbreviation (representing Paragraph), but they have any different with general text as " p " because they are placed in single punctuation marks used to enclose the title.So the Paragragh label is<p 〉, piece is quoted label and is<blockquote 〉.Some html tag instruction page is formatted (for example, beginning a new paragraph) how, and other illustrate then how these speech show, and (<b〉make literal chap) also has some other labels to be provided at the information that does not show on the page, for example title.
Html tag becomes two and occurs.Whenever using a label, as<blockquote 〉, then must with another label</blockquote it is closed.Slash before the blockquote is closed label and the difference of opening label exactly.But some label exceptions are arranged also.Such as,<input〉label just do not need.
Usually, html source code begins with DOCTYPE, the type of its statement document, and before it any content (comprising newline and space) can not be arranged, otherwise will make the document statement invalid, and then be<html〉label, with</html〉the label end.<html〉label and</html label also is a kind of in the html tag, between them, full page has two parts, title and text.Wherein, heading is clipped in<head〉label and</head between the label, this word appears at the minimized window of bottom of screen when opening the page.Text then is clipped in<body〉label and</body between the label, i.e. the content place of all pages.Anything that shows on the page is included among these two labels.
The div label is a kind of in the html tag, is to be used for providing for the content of bulk in the html source code (block-level) element of structure and background.The div label comprises: start-tag<div〉and end-tag</div 〉, all the elements between these two labels all are used for constituting this piece.The div label is called and separates mark, and its effect is: the putting position of setting word, picture, form etc.Because the div label is all arranged in the html source code of the webpage of any template.Character string in the present embodiment forms module 11 in concrete enforcement, be used for carrying out the step 101 of preceding method embodiment one, promptly html source code is divided into character string with the div label, do not need to consider that this webpage is the template of which kind of type, thereby html source code is divided into different character strings, the tabulation of formation character string has versatility;
For example, this section HTML has simulated the structure of news website below.Each div label is wherein combined the title of every news and summary.
<body>
<h1>NEWS?WEBSITE</h1>
<p>some?text.some?text.some?text...</p>
...
<div?class=″news″>
<h2>News?headline?1</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
<div?class=″news″>
<h2>News?headline?2</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
...
</body>
In the present embodiment, with the div label is the sign label, promptly with<div〉and</div〉be the boundary, with each group<div〉and</div in the character string that comprises extract separately, for example, with first group<div in the above-mentioned html source code〉and</div between character string extract as first character string, that is:
First character string is:
<div?class=″news″>
<h2>News?headline?1</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
Then, again with second group<div in the above-mentioned html source code〉and</div between character string extract as second character string, that is:
Second character string is:
<div?class=″news″>
<h2>News?headline?2</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
By that analogy, will own<div〉and</div between character string extract with this, form the character string tabulation.
The first string analysis module 12, be used for analyzing one by one described character string and form each character string in the character string tabulation that module 10 forms, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
Particularly, at the character string of dividing with the div label, by by character outside the various html tags among the first string analysis module, the 12 comparison of aforementioned method embodiment one and the character number in the html tag, if the character number outside the html tag is greater than the character number in the html tag, and, can judge that then the content in this character string is obtained subject information greater than predetermined radix value.In concrete enforcement, this first string analysis module 12 is used for carrying out the step 102 of preceding method embodiment one.
Enforcement the invention provides a kind of acquiring internet subject information device, need not stick to unified network template, and provide a kind of universal mode, with the div label html source code is divided into different character strings, and each character string carried out analyzing and processing, thereby can accurately analyze and handle webpages all on the internet, to obtain subject information.
Referring to Fig. 5, be the structural representation of a kind of acquiring internet subject information device embodiment two of providing among the present invention.
Need to prove that at first the device that the embodiment of the invention provides both can be used to gather theme of news information, also can be used to gather the daily record subject information.
The device that present embodiment provides, source code acquisition module 10, character string in comprising aforementioned acquiring internet subject information device embodiment one (hereinafter to be referred as device embodiment one) forms the module 11 and the first string analysis module 12, also comprise: radix setting module 13, information downloading module 14, the information filtering module 15 and the second string analysis module 16, character string processing module 17, information acquisition module 18, their function and effect are as follows:
Radix setting module 13, being used for according to subject information to be collected is theme of news information or daily record subject information, and the value of described radix is set at different values;
Concrete, the subject information of Cai Jiing is the news information or the difference of subject information as required, and during greater than some radix values, radix setting module 13 can this radix value be set to difference to the character number outside the html tag in the analysis character string.
Device among the embodiment two also comprises:
Information downloading module 14 is used to download the expandable mark language XML page, extracts list information; And download uniform resource position mark URL in the described list information, and send to described source code acquisition module 10 and handle.
Concrete, if need to gather theme of news information, 14 of information downloading module are downloaded the XML page, therefrom extract news list information; If gather the daily record subject information, 14 of information downloading module are extracted log list information from the XML page of downloading; And download uniform resource position mark URL in the described list information;
In specific embodiment, this information downloading module 14 is used for carrying out step 200 and the step 201 of preceding method embodiment two;
After this, described source code acquisition module 10 obtains html source code from described list information and URL;
Information filtering module 15 is used for filtering in the html source code that described source code acquisition module 10 gets access to and the irrelevant html tag of subject information.
Concrete, information filtering module 15 is used to filter out as script label, style label, object label, iframe label, form label etc. and the irrelevant html tag of subject information;
Wait in the specific implementation, information filtering module 15 is used for carrying out the step 202 of preceding method embodiment two;
After this, forming module 11 by aforesaid character string serves as that the sign label is divided into different character strings with described html source code with the div label, and described different character string is formed the character string tabulation; Analyze each character string in the tabulation of described character string one by one by the aforesaid first string analysis module 12 again, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information;
In order further to improve the accuracy of gathering subject information on the basis of device embodiment one, the device in the present embodiment two also comprises:
The second string analysis module 16 is used to obtain via after 12 analyses of the described first string analysis module character string of the outer number of characters maximum of html tag in the described character string tabulation; And analyze in the described character string tabulation the preceding character string of the character string of the outer number of characters maximum of described html tag and back character string; If the character number before described outside character string and/or the satisfied html tag wherein of back character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this preceding character string and/or the back character string as subject information.In the time of concrete enforcement, step 206~step 207 that the second string analysis module 16 is carried out among the preceding method embodiment two;
Device in the present embodiment two also comprises:
Character string processing module 17, be used for before described character string and/or described back character string and satisfy character number outside wherein the html tag greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this preceding character string and/or back character string with the character string of the outer number of characters maximum of described html tag character string as a result of; And described character string as a result handled, to gather subject information.In the time of specific embodiment, step 208~step 209 that this character string processing module 17 is carried out among the preceding method embodiment two;
Device in the present embodiment two also comprises:
Information acquisition module 18, be used for the described described character string of handling through character string processing module 17 as a result and this as a result the subject information that comprises of character string preserve, use for user's secondary development.
Enforcement the invention provides a kind of acquiring internet subject information device, need not stick to unified network template, a kind of method in common is provided, at first html source code is divided into different character strings with the div label, and each character string carried out analyzing and processing, can accurately analyze and handle webpages all on the internet, and character string is by analysis carried out secondary analysis, further improve the accuracy of analyzing webpage on the internet, thereby collect subject information fast and accurately.
Referring to Fig. 6, be the structural representation of a kind of acquiring internet subject information device embodiment three among the present invention.
In the present embodiment, with the character string processing module of describing in detail among the aforementioned means embodiment two 17;
Described character string processing module 17 specifically comprises: unit 172, tag processes unit 173 are replaced in character filter element 170, picture download unit 171, path, and their function and effect are as follows:
Character filter element 170 is used for the character that each html tag of character string as a result is outer and compares with filtering key word, filters and the irrelevant character of subject information; Concrete, described filtration key word is scheduled to, and is specially illegal key word or advertisement keywords, navigation bar key word, survey key word or the like and the irrelevant noise information of subject information; In concrete enforcement, this character filter element 170 is used for carrying out the step 300 of preceding method embodiment three;
Picture download unit 171 is used for extracting described process character filter element 170 and filters all picture image labels of character string as a result afterwards, and the download pictures resource is also preserved; Can also obtain simultaneously picture width and height;
Unit 172 is replaced in the path, is used for the Internet resources path of described character string is as a result replaced with the local resource path;
Tag processes unit 173 is used for keeping the paragraph p label and the picture image label of described character string as a result, deletes other labels in the described character string as a result.
Enforcement the invention provides a kind of acquiring internet subject information device, accurately gather on the basis of subject information fast at coupling apparatus embodiment one and device embodiment two, to the further purified treatment of subject information of gathering, and news or the original form of daily record have been kept, can also keep the picture in original webpage, therefore can better be used by secondary development.
One of ordinary skill in the art will appreciate that all or part of flow process that realizes in the foregoing description method, be to instruct relevant hardware to finish by computer program, described program can be stored in the computer read/write memory medium, this program can comprise the flow process as the embodiment of above-mentioned each side method when carrying out.Wherein, described storage medium can be magnetic disc, CD, read-only storage memory body (Read-Only Memory, ROM) or at random store memory body (Random Access Memory, RAM) etc.
The above is a preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also are considered as protection scope of the present invention.