CN101702160A - Method for acquiring internet subject information and device thereof - Google Patents

Method for acquiring internet subject information and device thereof Download PDF

Info

Publication number
CN101702160A
CN101702160A CN200910110356A CN200910110356A CN101702160A CN 101702160 A CN101702160 A CN 101702160A CN 200910110356 A CN200910110356 A CN 200910110356A CN 200910110356 A CN200910110356 A CN 200910110356A CN 101702160 A CN101702160 A CN 101702160A
Authority
CN
China
Prior art keywords
character string
character
subject information
html
html tag
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN200910110356A
Other languages
Chinese (zh)
Other versions
CN101702160B (en
Inventor
黎柯
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Longguan Media Co., Ltd.
Original Assignee
Shenzhen Coship Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Coship Electronics Co Ltd filed Critical Shenzhen Coship Electronics Co Ltd
Priority to CN 200910110356 priority Critical patent/CN101702160B/en
Publication of CN101702160A publication Critical patent/CN101702160A/en
Application granted granted Critical
Publication of CN101702160B publication Critical patent/CN101702160B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method for acquiring internet subject information and a device thereof, wherein the method comprises the steps of: acquiring a hyper text makeup language HTML source code of an internet webpage; dividing the HTML source code into different character strings by taking a div label as a mark label, and forming the different character strings into a character string table; and analyzing each character string in the character string table one by one, and when the number of the character outside an HTML label in some character string is larger than that of the character in the HTML label, and the number of the character outside the HTML label is larger than a set base number, taking the content included in the character strings as the subject information. The internet subject information acquiring method and the device thereof divides the HTML source code into a plurality of character strings with the div label and analyzes the character strings, thereby obtaining the subject information, being capable of processing webpage information of different webpage moulds on the internet, and improving the accuracy for acquiring the subject information.

Description

A kind of method for acquiring internet subject information and device
Technical field
The present invention relates to a kind of treatment technology of internet information, relate in particular to a kind of method for acquiring internet subject information and device.
Background technology
Browse the info web on the Web, can find that they comprise two parts content usually, what a part of content embodied is the subject information of webpage, and such as the news information part in the news web page, we are referred to as " theme " information; Another part then is and the irrelevant contents such as navigation bar, advertising message, copyright information and questionnaire of subject content, is referred to as " noise " information.Noise information is distributed in around the subject information usually, also be mixed in the middle of the subject content sometimes, but they there is no content relevance.
Noise information normally occurs with the form of link navigation literal (anchor text), and therefore, the webpage that noise information can cause interlinking does not usually have content relevance yet.Like this, the noise content in the webpage not only goes up to Web and brings difficulty based on the application system of web page contents, brings difficulty also for the application system of pointing to based on the super chain of webpage.
After discerning and remove the noise content in the webpage fast and accurately, the subject content that can gather webpage is to carry out follow-up processing or exploitation.
In the prior art one, propose one and removed noise information in the internet web page, gather the method for subject information, this method is foundation<table at first〉tag tree of label configurations webpage, and then foundation<table label throws the net one and page be planned to mutually nested content piece; Then, for the webpage collection that the same template of use is made, finding out at this webpage and concentrate the content that repeatedly occurs, as redundant content, is exactly the effective information piece and concentrate the less content piece of common appearance at this webpage.Experimental results show that this method is effectively, but this method must be confined to the webpage collection based on same template, and the web page template on the Web is countless, so this method is obviously general inadequately.
HTML (HyperText Mark-up Language, HTML (Hypertext Markup Language)) is a kind of identifiable language (Markup Language), has wherein defined the page layout when a cover label is portrayed web displaying.Therefore, for the most frequently used structure method for expressing of html web page be the tag tree of structure webpage.Existing tag tree structure instrument is a lot, and DOM (Document Object Model, DOM Document Object Model) is a tag tree structure instrument commonly used, and it can be organized into one tree shape structure according to nest relation with the label in the webpage.Realize the useful subject information of noise reduction ice collection, at first according to HTML code, generate dom tree, the parsing tree element extracts subject information then.
The DOM full name is DOM Document Object Model (Document Object Model, DOM), it is expressed as a tree structure according to the nest relation between the mark in the document with document, the element in the document, attribute, all is node with character data, note and the processing instruction etc. analyzed.
Prior art two implementation steps are as follows:
1, the html document with not enough standard is organized into the good XHTML document of form;
2, the XHTML document is resolved to a tree-model---dom tree;
3, carry out the extraction of information around dom tree then;
4, utilize the structure of the sample webpage that the inductive learning user provides, just can generate an XML document, only keep the node of user's interest information in this XML document, thereby finish information extraction according to the node among the DOM.
The inventor finds that prior art two has following shortcoming at least in implementing process of the present invention:
Dom tree is relatively complicated, and analysis efficiency is lower, and speed is slow; And dom tree is of a great variety, if will obtain correct subject information, has bigger difference and difficulty.
Summary of the invention
Technical matters to be solved by this invention is, at above-mentioned the deficiencies in the prior art, the invention provides a kind of method for acquiring internet subject information and device, need not stick to unified network template, and provide a kind of method in common, accurate analysis is also handled all webpages on the internet, to obtain subject information.
A kind of method for acquiring internet subject information that the embodiment of the invention provides comprises:
Obtain the HTML (Hypertext Markup Language) html source code of internet web page;
With the div label serves as that the sign label is divided into different character strings with described html source code, and described different character string is formed the character string tabulation;
Analyze each character string in the described character string tabulation one by one, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
The embodiment of the invention also provides a kind of acquiring internet subject information device, comprising:
The source code acquisition module is used to obtain the HTML (Hypertext Markup Language) html source code of internet web page;
Character string forms module, and being used for the div label serves as that the sign label is divided into different character strings with described html source code, and described different character string is formed the character string tabulation;
The first string analysis module, be used for analyzing one by one each character string of described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
Implement method for acquiring internet subject information provided by the invention and device, by html source code being divided into a plurality of character strings with the div label, again a plurality of character strings are analyzed, thereby obtain subject information, can handle the info web of different web pages template on the internet, and improve the accuracy of topic information acquisition.
Description of drawings
In order to be illustrated more clearly in the embodiment of the invention or technical scheme of the prior art, to do to introduce simply to the accompanying drawing of required use in embodiment or the description of the Prior Art below, apparently, accompanying drawing in describing below only is some embodiments of the present invention, for those of ordinary skills, under the prerequisite of not paying creative work, can also obtain other accompanying drawing according to these accompanying drawings.
Fig. 1 is the schematic flow sheet of method for acquiring internet subject information embodiment one among the present invention;
Fig. 2 is the schematic flow sheet of method for acquiring internet subject information embodiment two among the present invention;
Fig. 3 is the schematic flow sheet of method for acquiring internet subject information embodiment three among the present invention;
Fig. 4 is the schematic flow sheet of acquiring internet subject information device embodiment one among the present invention;
Fig. 5 is the schematic flow sheet of acquiring internet subject information device embodiment two among the present invention;
Fig. 6 is the schematic flow sheet of acquiring internet subject information device embodiment three among the present invention.
Embodiment
The invention provides a kind of method for acquiring internet subject information and device, need not stick to unified network template, and a kind of method in common is provided, accurately analyze and handle webpages all on the internet, to obtain subject information.
Referring to Fig. 1, the schematic flow sheet of the embodiment one of the method for acquiring internet subject information that provides for the embodiment of the invention.
The method for acquiring internet subject information that the embodiment of the invention provides comprises:
Step 100 is obtained the HTML (Hypertext Markup Language) html source code of internet web page;
Need to prove that HTML is the abbreviation of hypertext language, generally is used to write webpage,, can understand the structure of this webpage and the specific address of some pictures or video by checking the html source code of webpage on the network.
Step 101, with the div label serves as that the sign label is divided into different character strings with described html source code, and with described different character string formation character string tabulation (according to the notification of examiner's opinion that my department received in the past, best herein and inventor links up down again, gives the concrete format write of character string);
Need to prove, html tag normally the full name of english vocabulary (quote as piece: blockquote) or abbreviation (representing Paragraph), but they have any different with general text as " p " because they are placed in single punctuation marks used to enclose the title.So the Paragragh label is<p 〉, piece is quoted label and is<blockquote 〉.Some html tag instruction page is formatted (for example, beginning a new paragraph) how, and other illustrate then how these speech show, and (<b〉make literal chap) also has some other labels to be provided at the information that does not show on the page, for example title.
Html tag becomes two and occurs.Whenever using a label, as<blockquote 〉, then must with another label</blockquote it is closed.Slash before the blockquote is closed label and the difference of opening label exactly.But some label exceptions are arranged also.Such as,<input〉label just do not need.
Usually, html source code begins with DOCTYPE, the type of its statement document, and before it any content (comprising newline and space) can not be arranged, otherwise will make the document statement invalid, and then be<html〉label, with</html〉the label end.<html〉label and</html label also is a kind of in the html tag, between them, full page has two parts, title and text.Wherein, heading is clipped in<head〉label and</head between the label, this word appears at the minimized window of bottom of screen when opening the page.Text then is clipped in<body〉label and</body between the label, i.e. the content place of all pages.Anything that shows on the page is included among these two labels.
The div label is a kind of in the html tag, is to be used for providing for the content of bulk in the html source code (block-level) element of structure and background.The div label comprises: start-tag<div〉and end-tag</div 〉, all the elements between these two labels all are used for constituting this piece, wherein the characteristic of institute's containing element is controlled by the attribute of div label, or by using this piece of fstyleformat.scrolltrackization to control.
The div label is called and separates mark, and its effect is: the putting position of setting word, picture, form etc.When literal, image, or other be placed in the div label, it can be referred to as " DIV block ", or " DIV element " or " CSS-layer ", or is " layer " i.e. " level ".
Because the div label is all arranged in the html source code of the webpage of any template, with the div label html source code is divided into character string, do not need to consider that this webpage is the template of which kind of type, so have versatility;
For example, this section HTML has simulated the structure of news website below.Each div label is wherein combined the title of every news and summary.
<body>
<h1>NEWS?WEBSITE</h?1>
<p>some?text.some?text.some?text...</p>
...
<div?class=″news″>
<h2>News?headline?1</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
<div?class=″news″>
<h2>News?headline?2</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
...
</body>
In the present embodiment, with the div label is the sign label, promptly with<div〉and</div〉be the boundary, with each group<div〉and</div in the character string that comprises extract separately, for example, with first group<div in the above-mentioned html source code〉and</div between character string extract as first character string, that is:
First character string is:
<div?class=″news″>
<h2>News?headline?1</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
Then, again with second group<div in the above-mentioned html source code〉and</div between character string extract as second character string, that is:
Second character string is:
<div?class=″news″>
<h2>News?headline?2</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
By that analogy, will own<div〉and</div between character string extract with this, form the character string tabulation.
Step 102 is analyzed each character string in the character string tabulation, one by one to analyze subject information;
Concrete, analyze each character string in the described character string tabulation one by one, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
Therefore, at the character string of dividing with the div label, by character outside the more above-mentioned various html tags and the character number in the html tag, if the character number outside the html tag is greater than the character number in the html tag, and, can judge that then the content in this character string is obtained subject information greater than predetermined radix value.
Enforcement the invention provides a kind of method for acquiring internet subject information, need not stick to unified network template, and provide a kind of method in common, with the div label html source code is divided into different character strings, and each character string carried out analyzing and processing, thereby can accurately analyze and handle webpages all on the internet, to obtain subject information.
Referring to Fig. 2, be the schematic flow sheet of a kind of method for acquiring internet subject information embodiment two of providing among the present invention.
Need to prove that at first the method that the embodiment of the invention provides both can be used to gather theme of news information, also can be used to gather the daily record subject information; The subject information of Cai Jiing is the difference of news information or daily record subject information as required, and whether the character number outside the html tag in analyzing character string during greater than some radix values, can this radix value be set to difference.
Step 200 is downloaded extend markup language (XML, the Extensible Markup Language) page, extracts list information;
Concrete, if need to gather theme of news information, then download the XML page, therefrom extract news list information; If gather the daily record subject information, then from the XML page of downloading, extract log list information;
Step 201 is downloaded the uniform resource position mark URL in the described list information, in order to obtain the html source code of subject information place webpage.
Concrete, can obtain the html source code of the theme of news information place page, perhaps obtain the source code of the HTML of daily record subject information place webpage.
Step 202, filter in the described html source code html label irrelevant with subject information (that is,<html label and</html label).
Concrete, filter out the html tag that had nothing to do in new day with theme of news information or daily record theme in the html source code, for example script label, style label, object label, iframe label, form label;
Step 203 is obtained the html source code of internet web page;
In the present embodiment, because this html source code has filtered out and theme of news information or the irrelevant html tag of daily record subject information, therefore than a last embodiment, improved efficient, laid a good foundation for improving the accuracy of gathering subject information for analyzing character string.
Step 204 serves as that the sign label is divided into different character strings with described html source code with the div label, and described different character string is formed the character string tabulation.
For example, this section HTML has simulated the structure of news website below.Each div label is wherein combined the title of every news and summary.
<body>
<h1>NEWS?WEBSITE</h1>
<p>some?text.some?text.some?text...</p>
...
<div?class=″news″>
<h2>News?headline?1</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
<div?class=″news″>
<h2>News?headline?2</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
...
</body>
In the present embodiment, with the div label is the sign label, promptly with<div〉and</div〉be the boundary, with each group<div〉and</div in the character string that comprises extract separately, for example, with first group<div in the above-mentioned html source code〉and</div between character string extract as first character string, that is:
First character string is:
<div?class=″news″>
<h2>News?headline?1</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
Then, again with second group<div in the above-mentioned html source code〉and</div between character string extract as second character string, that is:
Second character string is:
<div?class=″news″>
<h2>News?headline?2</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
By that analogy, will own<div〉and</div between character string extract with this, form the character string tabulation.
Step 205, analyze each character string in the described character string tabulation one by one, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
Need to prove that if will to gather subject information be theme of news information, then described radix is set to 50, less than this value, generally not theme of news information;
In order on the basis of embodiment one, further to improve the accuracy of gathering subject information, in the present embodiment two, also comprise:
Step 206 is obtained in the described character string tabulation character string of the outer number of characters maximum of html tag;
Step 207 is analyzed in the described character string tabulation the preceding character string of the character string of the outer number of characters maximum of described html tag and back character string;
Particularly, if the character number before described outside character string and/or the satisfied html tag wherein of back character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this preceding character string and/or the back character string as subject information.
Step 208, character string and/or described back character string before analyzing are to obtain character string as a result;
Particularly, if the character number in described preceding character string and/or the described back character string outside the satisfied html tag wherein is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this preceding character string and/or back character string with the character string of the outer number of characters maximum of described html tag character string as a result of;
Step 209 is handled described character string as a result, to gather subject information.
At last, step 210 is preserved the subject information and this character string that comprise in the described character string through step 209 processing, uses for secondary development.
Enforcement the invention provides a kind of method for acquiring internet subject information, need not stick to unified network template, a kind of method in common is provided, at first html source code is divided into different character strings with the div label, and each character string carried out analyzing and processing, can accurately analyze and handle webpages all on the internet, and character string is by analysis carried out secondary analysis, further improve the accuracy of analyzing webpage on the internet, thereby collect subject information fast and accurately.
Referring to Fig. 3, be the schematic flow sheet of a kind of method for acquiring internet subject information embodiment three among the present invention.
To describe the step 209 among the embodiment two in the present embodiment in detail, it specifically comprises:
Step 300, the character that each html tag in the character string as a result is outer compares with filtering key word, filters and the irrelevant character of subject information to be collected;
Described filtration key word is scheduled to, and is specially illegal key word or advertisement keywords, navigation bar key word, survey key word or the like and the irrelevant noise information of subject information;
Step 301 is extracted all picture image labels in the described filtration character string as a result afterwards, and the download pictures resource is also preserved; Can also obtain simultaneously picture width and height;
Step 302 replaces with the local resource path with the Internet resources path in the described character string as a result;
Step 303 keeps paragraph p label and picture image label in the described character string as a result, deletes other labels in the described character string as a result.
At last, the subject information and this character string that comprise in the described character string through the processing of 300~step 303 are preserved, use for secondary development.
Enforcement the invention provides a kind of method for acquiring internet subject information, in conjunction with the embodiments one and embodiment two accurately gather on the basis of subject informations fast, to the further purified treatment of subject information of gathering, and news or the original form of daily record have been kept, can also keep the picture in original webpage, therefore can better be used by secondary development.
Referring to Fig. 4, be the structural representation of a kind of acquiring internet subject information device embodiment one among the present invention.
The acquiring internet subject information device of present embodiment comprises: source code acquisition module 10, character string form the module 11 and the first string analysis module 12, and their function and effect are as follows:
Source code acquisition module 10 is used to obtain the html source code of internet web page;
In the time of concrete enforcement, this source code acquisition module 10 is used for carrying out the step 100 of aforementioned method for acquiring internet subject information embodiment one (back abbreviation method embodiment one);
Character string forms module 11, and being used for the div label serves as that the sign label is divided into different character strings with described html source code, and described different character string is formed the character string tabulation;
Need to prove, html tag normally the full name of english vocabulary (quote as piece: blockquote) or abbreviation (representing Paragraph), but they have any different with general text as " p " because they are placed in single punctuation marks used to enclose the title.So the Paragragh label is<p 〉, piece is quoted label and is<blockquote 〉.Some html tag instruction page is formatted (for example, beginning a new paragraph) how, and other illustrate then how these speech show, and (<b〉make literal chap) also has some other labels to be provided at the information that does not show on the page, for example title.
Html tag becomes two and occurs.Whenever using a label, as<blockquote 〉, then must with another label</blockquote it is closed.Slash before the blockquote is closed label and the difference of opening label exactly.But some label exceptions are arranged also.Such as,<input〉label just do not need.
Usually, html source code begins with DOCTYPE, the type of its statement document, and before it any content (comprising newline and space) can not be arranged, otherwise will make the document statement invalid, and then be<html〉label, with</html〉the label end.<html〉label and</html label also is a kind of in the html tag, between them, full page has two parts, title and text.Wherein, heading is clipped in<head〉label and</head between the label, this word appears at the minimized window of bottom of screen when opening the page.Text then is clipped in<body〉label and</body between the label, i.e. the content place of all pages.Anything that shows on the page is included among these two labels.
The div label is a kind of in the html tag, is to be used for providing for the content of bulk in the html source code (block-level) element of structure and background.The div label comprises: start-tag<div〉and end-tag</div 〉, all the elements between these two labels all are used for constituting this piece.The div label is called and separates mark, and its effect is: the putting position of setting word, picture, form etc.Because the div label is all arranged in the html source code of the webpage of any template.Character string in the present embodiment forms module 11 in concrete enforcement, be used for carrying out the step 101 of preceding method embodiment one, promptly html source code is divided into character string with the div label, do not need to consider that this webpage is the template of which kind of type, thereby html source code is divided into different character strings, the tabulation of formation character string has versatility;
For example, this section HTML has simulated the structure of news website below.Each div label is wherein combined the title of every news and summary.
<body>
<h1>NEWS?WEBSITE</h1>
<p>some?text.some?text.some?text...</p>
...
<div?class=″news″>
<h2>News?headline?1</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
<div?class=″news″>
<h2>News?headline?2</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
...
</body>
In the present embodiment, with the div label is the sign label, promptly with<div〉and</div〉be the boundary, with each group<div〉and</div in the character string that comprises extract separately, for example, with first group<div in the above-mentioned html source code〉and</div between character string extract as first character string, that is:
First character string is:
<div?class=″news″>
<h2>News?headline?1</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
Then, again with second group<div in the above-mentioned html source code〉and</div between character string extract as second character string, that is:
Second character string is:
<div?class=″news″>
<h2>News?headline?2</h2>
<p>some?text.some?text.some?text...</p>
...
</div>
By that analogy, will own<div〉and</div between character string extract with this, form the character string tabulation.
The first string analysis module 12, be used for analyzing one by one described character string and form each character string in the character string tabulation that module 10 forms, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
Particularly, at the character string of dividing with the div label, by by character outside the various html tags among the first string analysis module, the 12 comparison of aforementioned method embodiment one and the character number in the html tag, if the character number outside the html tag is greater than the character number in the html tag, and, can judge that then the content in this character string is obtained subject information greater than predetermined radix value.In concrete enforcement, this first string analysis module 12 is used for carrying out the step 102 of preceding method embodiment one.
Enforcement the invention provides a kind of acquiring internet subject information device, need not stick to unified network template, and provide a kind of universal mode, with the div label html source code is divided into different character strings, and each character string carried out analyzing and processing, thereby can accurately analyze and handle webpages all on the internet, to obtain subject information.
Referring to Fig. 5, be the structural representation of a kind of acquiring internet subject information device embodiment two of providing among the present invention.
Need to prove that at first the device that the embodiment of the invention provides both can be used to gather theme of news information, also can be used to gather the daily record subject information.
The device that present embodiment provides, source code acquisition module 10, character string in comprising aforementioned acquiring internet subject information device embodiment one (hereinafter to be referred as device embodiment one) forms the module 11 and the first string analysis module 12, also comprise: radix setting module 13, information downloading module 14, the information filtering module 15 and the second string analysis module 16, character string processing module 17, information acquisition module 18, their function and effect are as follows:
Radix setting module 13, being used for according to subject information to be collected is theme of news information or daily record subject information, and the value of described radix is set at different values;
Concrete, the subject information of Cai Jiing is the news information or the difference of subject information as required, and during greater than some radix values, radix setting module 13 can this radix value be set to difference to the character number outside the html tag in the analysis character string.
Device among the embodiment two also comprises:
Information downloading module 14 is used to download the expandable mark language XML page, extracts list information; And download uniform resource position mark URL in the described list information, and send to described source code acquisition module 10 and handle.
Concrete, if need to gather theme of news information, 14 of information downloading module are downloaded the XML page, therefrom extract news list information; If gather the daily record subject information, 14 of information downloading module are extracted log list information from the XML page of downloading; And download uniform resource position mark URL in the described list information;
In specific embodiment, this information downloading module 14 is used for carrying out step 200 and the step 201 of preceding method embodiment two;
After this, described source code acquisition module 10 obtains html source code from described list information and URL;
Information filtering module 15 is used for filtering in the html source code that described source code acquisition module 10 gets access to and the irrelevant html tag of subject information.
Concrete, information filtering module 15 is used to filter out as script label, style label, object label, iframe label, form label etc. and the irrelevant html tag of subject information;
Wait in the specific implementation, information filtering module 15 is used for carrying out the step 202 of preceding method embodiment two;
After this, forming module 11 by aforesaid character string serves as that the sign label is divided into different character strings with described html source code with the div label, and described different character string is formed the character string tabulation; Analyze each character string in the tabulation of described character string one by one by the aforesaid first string analysis module 12 again, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information;
In order further to improve the accuracy of gathering subject information on the basis of device embodiment one, the device in the present embodiment two also comprises:
The second string analysis module 16 is used to obtain via after 12 analyses of the described first string analysis module character string of the outer number of characters maximum of html tag in the described character string tabulation; And analyze in the described character string tabulation the preceding character string of the character string of the outer number of characters maximum of described html tag and back character string; If the character number before described outside character string and/or the satisfied html tag wherein of back character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this preceding character string and/or the back character string as subject information.In the time of concrete enforcement, step 206~step 207 that the second string analysis module 16 is carried out among the preceding method embodiment two;
Device in the present embodiment two also comprises:
Character string processing module 17, be used for before described character string and/or described back character string and satisfy character number outside wherein the html tag greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this preceding character string and/or back character string with the character string of the outer number of characters maximum of described html tag character string as a result of; And described character string as a result handled, to gather subject information.In the time of specific embodiment, step 208~step 209 that this character string processing module 17 is carried out among the preceding method embodiment two;
Device in the present embodiment two also comprises:
Information acquisition module 18, be used for the described described character string of handling through character string processing module 17 as a result and this as a result the subject information that comprises of character string preserve, use for user's secondary development.
Enforcement the invention provides a kind of acquiring internet subject information device, need not stick to unified network template, a kind of method in common is provided, at first html source code is divided into different character strings with the div label, and each character string carried out analyzing and processing, can accurately analyze and handle webpages all on the internet, and character string is by analysis carried out secondary analysis, further improve the accuracy of analyzing webpage on the internet, thereby collect subject information fast and accurately.
Referring to Fig. 6, be the structural representation of a kind of acquiring internet subject information device embodiment three among the present invention.
In the present embodiment, with the character string processing module of describing in detail among the aforementioned means embodiment two 17;
Described character string processing module 17 specifically comprises: unit 172, tag processes unit 173 are replaced in character filter element 170, picture download unit 171, path, and their function and effect are as follows:
Character filter element 170 is used for the character that each html tag of character string as a result is outer and compares with filtering key word, filters and the irrelevant character of subject information; Concrete, described filtration key word is scheduled to, and is specially illegal key word or advertisement keywords, navigation bar key word, survey key word or the like and the irrelevant noise information of subject information; In concrete enforcement, this character filter element 170 is used for carrying out the step 300 of preceding method embodiment three;
Picture download unit 171 is used for extracting described process character filter element 170 and filters all picture image labels of character string as a result afterwards, and the download pictures resource is also preserved; Can also obtain simultaneously picture width and height;
Unit 172 is replaced in the path, is used for the Internet resources path of described character string is as a result replaced with the local resource path;
Tag processes unit 173 is used for keeping the paragraph p label and the picture image label of described character string as a result, deletes other labels in the described character string as a result.
Enforcement the invention provides a kind of acquiring internet subject information device, accurately gather on the basis of subject information fast at coupling apparatus embodiment one and device embodiment two, to the further purified treatment of subject information of gathering, and news or the original form of daily record have been kept, can also keep the picture in original webpage, therefore can better be used by secondary development.
One of ordinary skill in the art will appreciate that all or part of flow process that realizes in the foregoing description method, be to instruct relevant hardware to finish by computer program, described program can be stored in the computer read/write memory medium, this program can comprise the flow process as the embodiment of above-mentioned each side method when carrying out.Wherein, described storage medium can be magnetic disc, CD, read-only storage memory body (Read-Only Memory, ROM) or at random store memory body (Random Access Memory, RAM) etc.
The above is a preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also are considered as protection scope of the present invention.

Claims (18)

1. a method for acquiring internet subject information is characterized in that, comprising:
Obtain the HTML (Hypertext Markup Language) html source code of internet web page;
With the div label serves as that the sign label is divided into different character strings with described html source code, and described different character string formed character string tabulation (concrete format write that please the secondary characters string in the instructions, otherwise the auditor is probably to disclose insufficient notification of examiner's opinion that sends);
Analyze each character string in the described character string tabulation one by one, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
2. the method for claim 1 is characterized in that, described subject information is theme of news information or daily record subject information.
3. method as claimed in claim 2 is characterized in that, when subject information to be collected was theme of news information or daily record subject information, the value of described radix was set at difference.
4. method as claimed in claim 3 is characterized in that, when described subject information is theme of news information, before obtaining the HTML (Hypertext Markup Language) html source code step of internet web page, comprising:
Download the expandable mark language XML page, extract list information;
Download the uniform resource position mark URL in the described list information, in order to obtain the html source code of subject information place webpage.
5. method as claimed in claim 4 is characterized in that, described obtaining after the html source code comprises:
Filter the html tag that has nothing to do with subject information in the described html source code.
6. as each described method among the claim 1-5, it is characterized in that the described content that this character string is comprised also comprises as after the subject information:
Obtain in the described character string tabulation character string of the outer number of characters maximum of html tag;
Analyze in the described character string tabulation the preceding character string of the character string of the outer number of characters maximum of described html tag and back character string; If the character number before described outside character string and/or the satisfied html tag wherein of back character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this preceding character string and/or the back character string as subject information.
7. method as claimed in claim 6 is characterized in that, the content that this preceding character string and/or back are comprised in the character string comprises as after the subject information:
If the character number in described preceding character string and/or the described back character string outside the satisfied html tag wherein is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this preceding character string and/or back character string with the character string of the outer number of characters maximum of described html tag character string as a result of;
Described character string is as a result handled, gathered subject information.
8. method as claimed in claim 7 is characterized in that, described character string is as a result handled, and specifically comprises:
The character that each html tag in the described character string as a result is outer compares with filtering key word, filters and the irrelevant character of subject information to be collected;
Extract all picture image labels in the described filtration character string as a result afterwards, the download pictures resource is also preserved;
Internet resources path in the described character string is as a result replaced with the local resource path;
Keep paragraph p label and picture image label in the described character string as a result, delete other labels in the described character string as a result.
9. method as claimed in claim 8 is characterized in that, the subject information that described treated described character string as a result and this comprise in character string is as a result preserved, and uses for secondary development.
10. an acquiring internet subject information device is characterized in that, comprising:
The source code acquisition module is used to obtain the HTML (Hypertext Markup Language) html source code of internet web page;
Character string forms module, and being used for the div label serves as that the sign label is divided into different character strings with described html source code, and described different character string is formed the character string tabulation;
The first string analysis module, be used for analyzing one by one each character string of described character string tabulation, when the character number outside the html tag in certain character string greater than the character number in the described html tag, and the outer character number of html tag is during greater than the radix set, and the content that this character string is comprised is as subject information.
11. device as claimed in claim 10 is characterized in that, described subject information is theme of news information or daily record subject information.
12. device as claimed in claim 11 is characterized in that, described device also comprises:
The radix setting module, being used for according to subject information to be collected is theme of news information or daily record subject information, and the value of described radix is set at different values.
13. device as claimed in claim 12 is characterized in that, described device also comprises:
Information downloading module is used to download the expandable mark language XML page, extracts list information; And download uniform resource position mark URL in the described list information, and send to described source code acquisition module and handle.
14. device as claimed in claim 13 is characterized in that, described device also comprises:
The information filtering module is used for filtering in the html source code that described source code acquisition module gets access to and the irrelevant html tag of subject information.
15., it is characterized in that described device also comprises as each described device among the claim 10-14:
The second string analysis module is used to obtain via behind the described first string analysis module analysis, the character string of the outer number of characters maximum of html tag in the described character string tabulation; And analyze in the described character string tabulation the preceding character string of the character string of the outer number of characters maximum of described html tag and back character string; If the character number before described outside character string and/or the satisfied html tag wherein of back character string is greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with the content that comprises in this preceding character string and/or the back character string as subject information.
16. device as claimed in claim 15 is characterized in that, described device also comprises:
The character string processing module, be used for before described character string and/or described back character string and satisfy character number outside wherein the html tag greater than the character number in the described html tag, and the outer character number of html tag is during greater than the condition of the radix of setting, with this preceding character string and/or back character string with the character string of the outer number of characters maximum of described html tag character string as a result of; And described character string as a result handled, to gather subject information.
17. device as claimed in claim 16 is characterized in that, described character string processing module specifically comprises:
The character filter element is used for the character that each html tag of described character string as a result is outer and compares with filtering key word, filters and the irrelevant character of subject information;
The picture download unit is used for extracting described process character filter element and filters all picture image labels of character string as a result afterwards, and the download pictures resource is also preserved;
The unit is replaced in the path, is used for the Internet resources path of described character string is as a result replaced with the local resource path;
The tag processes unit is used for keeping the paragraph p label and the picture image label of described character string as a result, deletes other labels in the described character string as a result.
18. device as claimed in claim 17 is characterized in that, described device also comprises:
Information acquisition module, be used for the described described character string of handling through the character string processing module as a result and this as a result the subject information that comprises of character string preserve, use for user's secondary development.
CN 200910110356 2009-10-28 2009-10-28 Method for acquiring internet subject information and device thereof Expired - Fee Related CN101702160B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200910110356 CN101702160B (en) 2009-10-28 2009-10-28 Method for acquiring internet subject information and device thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200910110356 CN101702160B (en) 2009-10-28 2009-10-28 Method for acquiring internet subject information and device thereof

Publications (2)

Publication Number Publication Date
CN101702160A true CN101702160A (en) 2010-05-05
CN101702160B CN101702160B (en) 2013-04-17

Family

ID=42157075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200910110356 Expired - Fee Related CN101702160B (en) 2009-10-28 2009-10-28 Method for acquiring internet subject information and device thereof

Country Status (1)

Country Link
CN (1) CN101702160B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
CN102737116A (en) * 2012-05-29 2012-10-17 深圳市同洲电子股份有限公司 Method and device for storing webpage resources
CN102750392A (en) * 2012-07-09 2012-10-24 浙江省公众信息产业有限公司 Web topic information extraction method and system
CN103279567A (en) * 2013-06-18 2013-09-04 重庆邮电大学 Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language)
WO2013178193A2 (en) * 2012-11-20 2013-12-05 中兴通讯股份有限公司 Text content extraction method and device
CN103488621A (en) * 2013-09-24 2014-01-01 长沙裕邦软件开发有限公司 Type setting method and system for laws and regulations
CN104156458A (en) * 2014-08-20 2014-11-19 百度在线网络技术(北京)有限公司 Information extraction method and device
CN104750812A (en) * 2015-03-30 2015-07-01 浪潮集团有限公司 Automatic data collecting method based on webpage label analysis
CN105578294A (en) * 2014-10-15 2016-05-11 优视科技有限公司 Browsing switching processing method, apparatus, and system
CN111488511A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN113505271A (en) * 2021-07-14 2021-10-15 杭州隆埠科技有限公司 HTML document analysis method, HTML document transmission method, HTML document analysis device, and HTML document transmission device

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6961897B1 (en) * 1999-06-14 2005-11-01 Lockheed Martin Corporation System and method for interactive electronic media extraction for web page generation
CN101079031A (en) * 2006-06-15 2007-11-28 腾讯科技(深圳)有限公司 Web page subject extraction system and method
CN101470728B (en) * 2007-12-25 2011-06-08 北京大学 Method and device for automatically abstracting text of Chinese news web page

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102270206A (en) * 2010-06-03 2011-12-07 北京迅捷英翔网络科技有限公司 Method and device for capturing valid web page contents
CN102737116A (en) * 2012-05-29 2012-10-17 深圳市同洲电子股份有限公司 Method and device for storing webpage resources
CN102750392A (en) * 2012-07-09 2012-10-24 浙江省公众信息产业有限公司 Web topic information extraction method and system
CN102750392B (en) * 2012-07-09 2014-07-16 浙江省公众信息产业有限公司 Web topic information extraction method and system
WO2013178193A2 (en) * 2012-11-20 2013-12-05 中兴通讯股份有限公司 Text content extraction method and device
WO2013178193A3 (en) * 2012-11-20 2014-01-23 中兴通讯股份有限公司 Text content extraction method and device
CN103279567A (en) * 2013-06-18 2013-09-04 重庆邮电大学 Web data collection method and system both based on AJAX (asynchronous javascript and extensible markup language)
CN103488621A (en) * 2013-09-24 2014-01-01 长沙裕邦软件开发有限公司 Type setting method and system for laws and regulations
CN104156458A (en) * 2014-08-20 2014-11-19 百度在线网络技术(北京)有限公司 Information extraction method and device
CN104156458B (en) * 2014-08-20 2017-09-22 北京小度互娱科技有限公司 The extracting method and device of a kind of information
CN105578294A (en) * 2014-10-15 2016-05-11 优视科技有限公司 Browsing switching processing method, apparatus, and system
CN105578294B (en) * 2014-10-15 2018-12-21 优视科技有限公司 Browse switching handling method, apparatus and system
CN104750812A (en) * 2015-03-30 2015-07-01 浪潮集团有限公司 Automatic data collecting method based on webpage label analysis
CN111488511A (en) * 2019-01-25 2020-08-04 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN111488511B (en) * 2019-01-25 2024-04-09 深信服科技股份有限公司 Website theme extraction method and system, electronic equipment and storage medium
CN113505271A (en) * 2021-07-14 2021-10-15 杭州隆埠科技有限公司 HTML document analysis method, HTML document transmission method, HTML document analysis device, and HTML document transmission device

Also Published As

Publication number Publication date
CN101702160B (en) 2013-04-17

Similar Documents

Publication Publication Date Title
CN101702160B (en) Method for acquiring internet subject information and device thereof
CN102915308B (en) A kind of method of page rendering and device
CN102200971B (en) Method and equipment for realizing webpage content previewing
CN102270206A (en) Method and device for capturing valid web page contents
CN104217036B (en) A kind of webpage content extracting method and equipment
CN102929871A (en) Webpage browsing method and device and mobile terminal
CN102779169A (en) Extracting method and device for webpage content based on HTML (Hypertext Markup Language) label
JP2004139466A (en) Electronic document printing program, and electronic document printing system
CN103166981A (en) Wireless webpage transcoding method and device
CN103699591A (en) Page body extraction method based on sample page
CN103365877B (en) Method and server to establishing catalogue after webpage progress transcoding
CN108681547A (en) A kind of web content converting method and device based on small routine
CN107153716A (en) Webpage content extracting method and device
CN106547749B (en) Webpage data acquisition method and device
CN105740355B (en) Webpage context extraction method and device based on aggregation text density
US9619445B1 (en) Conversion of content to formats suitable for digital distributions thereof
CN113849718A (en) Internet tobacco science and technology information automatic acquisition device, method and storage medium
CN103729354B (en) web information processing method and device
CN103246680A (en) Method and device for aggregating and displaying webpage contents in browser
Sirsat et al. Pattern matching for extraction of core contents from news web pages
TWI292104B (en)
CN106934036A (en) A kind of method and system of Network Learning Resource aggregate query
KR20010044321A (en) XML Document APPLICATION(viewer, editer, converter)
CN115391711A (en) Webpage text information extraction method, device, equipment and medium
CN114625996A (en) Webpage content paging method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: SHENZHEN LONGSHI MEDIA CO., LTD.

Free format text: FORMER OWNER: SHENZHEN TONGZHOU ELECTRONIC CO., LTD.

Effective date: 20120424

C41 Transfer of patent application or patent right or utility model
COR Change of bibliographic data

Free format text: CORRECT: ADDRESS; FROM: 518129 SHENZHEN, GUANGDONG PROVINCE TO: 518057 SHENZHEN, GUANGDONG PROVINCE

TA01 Transfer of patent application right

Effective date of registration: 20120424

Address after: 518057 District, Guangdong, Nanshan District hi tech Zone, the North Zone of the Fifth Industrial Zone, rainbow science and technology building, A2-3 District,

Applicant after: Shenzhen Longguan Media Co., Ltd.

Address before: 518129 Rainbow Technology Building, North hi tech Zone, Nanshan District, Guangdong, Shenzhen

Applicant before: Shenzhen Tongzhou Electronic Co., Ltd.

C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130417

Termination date: 20161028