CN104572874B

CN104572874B - A kind of abstracting method and device of webpage information

Info

Publication number: CN104572874B
Application number: CN201410804430.9A
Authority: CN
Inventors: 刘雄伟
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2014-12-19
Filing date: 2014-12-19
Publication date: 2019-03-05
Anticipated expiration: 2034-12-19
Also published as: CN104572874A

Abstract

The embodiment of the invention discloses a kind of abstracting method of webpage information and device, the abstracting method of the webpage information includes: to obtain the uniform resource locator URL for being intended to Extracting Information webpage；Preset template is selected according to the URL for being intended to Extracting Information webpage；Webpage information is extracted using selected preset template.To improve the accuracy rate for extracting webpage information.

Description

A kind of abstracting method and device of webpage information

Technical field

The present invention relates to information technology field more particularly to the abstracting methods and device of a kind of webpage information.

Background technique

With the fast development of internet, the network media has goed deep into the day of people as a kind of new information mode of propagation Often life.Text Information Extraction technology is a kind of accurate, efficient information acquisition method.It is taken out from one or more webpages The information that the users such as the fixed entity of fetching, relationship and event need, and the data of structuring are formed, it is presented to the user.This side Method has many advantages, such as that content is accurate, redundancy is small, organizational norms.

In the prior art, there are many extractions that technical method can be used for recording webpage more.As that can be adopted in traditional method It is extracted with redaction rule.This method can quickly and accurately extract record information from specific data source.With Growing and web page contents the continuous renewal of network information, in face of the ever-changing data of magnanimity, only by single Human configuration template extract the relevant information of webpage, the accuracy rate of extraction will necessarily be reduced.Even if being only used for same field The extraction of site pages information, since its webpage number is more, layout style is various and changeable, and existing technical method still cannot Effectively improve the accuracy rate of Extracting Information.

Summary of the invention

In view of this, the embodiment of the present invention proposes the abstracting method and device of a kind of webpage information, to improve extraction webpage The accuracy rate of information.

In a first aspect, the embodiment of the invention provides a kind of abstracting methods of webpage information, which comprises

Obtain the uniform resource locator URL for being intended to Extracting Information webpage；

Preset template is selected according to the URL for being intended to Extracting Information webpage；

Webpage information is extracted using selected preset template.

Second aspect, the embodiment of the invention provides a kind of draw-out device of webpage information, described device includes:

URL acquiring unit, for obtaining the uniform resource locator URL for being intended to Extracting Information webpage；

Template selection unit, for selecting preset template according to the URL for being intended to Extracting Information webpage；

Web page information extraction unit, for extracting webpage information using selected preset template.

The abstracting method and device of webpage information provided in an embodiment of the present invention are intended to the system of Extracting Information webpage by obtaining One resource localizer URL；Preset template is selected according to the URL for being intended to Extracting Information webpage；It is set in advance using selected Fixed template extracts webpage information.To improve the accuracy rate for extracting webpage information.

Detailed description of the invention

By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, of the invention other Feature, objects and advantages will become more apparent upon:

Fig. 1 is the flow chart of the abstracting method for the webpage information that first embodiment of the invention provides；

Fig. 2 is the schematic diagram of the abstracting method for the webpage information that first embodiment of the invention provides；

Fig. 3 is the flow chart of the abstracting method for the webpage information that second embodiment of the invention provides；

Fig. 4 is the schematic diagram of the abstracting method for the webpage information that second embodiment of the invention provides；

Fig. 5 is the flow chart of the abstracting method for the webpage information that third embodiment of the invention provides；

Fig. 6 is the flow chart of the abstracting method for the webpage information that fourth embodiment of the invention provides；

Fig. 7 is the flow chart of the abstracting method for the webpage information that fifth embodiment of the invention provides；

Fig. 8 is the structure chart of the drawing-out structure for the webpage information that sixth embodiment of the invention provides.

Specific embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched State that the specific embodiments are only for explaining the present invention, rather than limitation of the invention.It also should be noted that for the ease of Description, only some but not all contents related to the present invention are shown in the drawings.

Fig. 1 and Fig. 2 shows the first embodiment of the present invention.

Fig. 1 is the flow chart of the abstracting method for the webpage information that first embodiment of the invention provides；Fig. 2 is webpage letter The abstracting method of the schematic diagram of the abstracting method of breath, the webpage information includes:

Step S101 obtains the uniform resource locator (URL) for being intended to Extracting Information webpage.

Uniform resource locator (Uniform Resoure Locator, URL) be to can from internet it is accessed The position of resource and a kind of succinct expression of access method, are the addresses of standard resource on internet.Each of on internet File has unique URL, and position that the information that it includes can point out file and browser are to the processing method of this document.

In addition, URL can also be used to the address as WWW.For the webpage that can be accessed on the internet, all In the presence of with uniform resource locator URL.Therefore, for the webpage of desire Extracting Information, it should obtain the URL of the webpage first. Such as the information of Netease's homepage to be extracted, then need first to obtain the URL (i.e. http://www.163.com/) of Netease's homepage.

Step S102 selects preset template according to the URL for the webpage for being intended to Extracting Information.

Different websites can preset different templates, this is because the information that different websites is shown has very greatly Difference, such as Sina website and Taobao.Sina website is as comprehensive portal website, and the information shown is based on news；And Taobao is then based on merchandise display.For above-mentioned two website, used Extracting Information template certainly exists larger difference.Such as Using same extraction template, since the regular expression of extraction template institute Extracting Information is only strung to the character accordingly set Effect, will necessarily reduce accuracy rate.Therefore, it can be selected accordingly in advance by obtaining to the URL for being intended to Extracting Information webpage The template of setting, to improve the accuracy rate for extracting webpage information.

Step S103 extracts webpage information using the preset template of selection.

According to the selected preset template of step S102, webpage information is extracted, the template can be one Group regular expression.Regular expression is a kind of logical formula to string operation, just refer to act first define it is some The combination of specific character and these specific characters forms one " regular character string ", this " regular character string " can be used to table Up to a kind of filter logic to character string.

A given regular expression and another character string can achieve following purpose: determine whether character string accords with Close the filter logic (referred to as " matching ") of regular expression；Us can also be obtained from character string and is thought by regular expression The specific part wanted.

By the regular expression of setting, can be removed from the related content in webpage is identified and extracted in web page contents Irrelevant contents, and the information being drawn into is stored in specified data library, it is inquired and is checked to facilitate.

The embodiment of the present invention is intended to the uniform resource locator (URL) of Extracting Information webpage by obtaining, and according to being intended to extract The URL of Intelligence Page selects preset template, and extracts webpage information using the preset template of selection, to mention The accuracy rate of high Extracting Information.

Embodiment two

Fig. 3 and Fig. 4 show the second embodiment of the present invention.

Fig. 3 is the flow chart of the abstracting method for the webpage information that second embodiment of the invention provides, and Fig. 4 is the present invention second The schematic diagram of the abstracting method of webpage information described in the abstracting method for the webpage information that embodiment provides.The webpage information Abstracting method is based on first embodiment, further, will acquire the uniform resource locator for being intended to the webpage of Extracting Information (URL) specific optimization are as follows: obtain the URL for being intended to Extracting Information webpage and be intended to URL included by Extracting Information webpage；It will be according to being intended to take out Take the URL of Intelligence Page that preset template is selected specifically to optimize are as follows: according to the URL and letter to be extracted for being intended to Extracting Information webpage It ceases URL included by webpage and selects preset template.

Referring to Fig. 3 and Fig. 4, the abstracting method of the webpage information includes:

Step S201 obtains the URL for being intended to Extracting Information webpage and is intended to URL included by Extracting Information webpage.

It is intended to may include multiple links inside the webpage of Extracting Information.For example, the webpage for being intended to Extracting Information is certain portal The portal site stood.It include the link, such as forum, news, finance and economics etc. of several subelements in its homepage such as Netease's homepage. Web page contents pointed by corresponding link and link can be obtained by web crawlers.Web crawlers is one and automatically extracts net The program of page can obtain the URL on Initial page since the URL of one or several Initial pages, in the process of crawl webpage In, new URL is constantly extracted from current page is put into queue.

Step S202 is set in advance according to URL selection included by the URL and desire Extracting Information webpage that are intended to Extracting Information webpage Fixed template.

It is intended to may include multiple links inside the webpage of Extracting Information.Such as certain portal website's homepage, it is wrapped in its homepage Include the link of several subelements, such as forum, news, finance and economics.Each subelement is needed since content deltas is very big according to each The URL of subelement chooses corresponding preset template, and template can be made of one group of regular expression.

Step S203 extracts webpage information using selected preset template.

The embodiment of the present invention, which passes through, will acquire the uniform resource locator (URL) for being intended to Extracting Information webpage specifically optimization are as follows: It obtains the URL for being intended to Extracting Information webpage and is intended to URL included by Extracting Information webpage；It will be according to the URL for being intended to Extracting Information webpage Preset template is selected specifically to optimize are as follows: according to the URL for being intended to Extracting Information webpage and to be intended to included by Extracting Information webpage URL selects preset template.Web crawlers can be used to obtain in webpage pointed by URL and URL included by webpage Hold, and selects suitable template to extract webpage information according to the URL for including.It thus can be in the feelings for guaranteeing accuracy rate Under condition, the extraction of multiple webpage informations is quickly finished automatically.

Embodiment three

Fig. 5 shows the third embodiment of the present invention.

Fig. 5 is the flow chart of the abstracting method for the webpage information that third embodiment of the invention provides, the webpage information Abstracting method based on first embodiment, further, obtain be intended to Extracting Information webpage uniform resource locator (URL) after, increase following steps: piecemeal is carried out to the page, selected described in advance according to the URL for being intended to Extracting Information webpage The template of setting specifically optimizes are as follows: selects preset template according to the URL for the webpage for being intended to Extracting Information and blocking information；It will The preset template of the use selection extracts webpage information and specifically includes: using according to desire Extracting Information webpage URL and the selected template that presets of blocking information extract webpage information.

Referring to Fig. 5, the abstracting method of the webpage information includes:

Step S301 obtains the uniform resource locator (URL) for being intended to Extracting Information webpage.

Step S302 carries out piecemeal to the page.

The page of Extracting Information carries out format setting by layout, to the text, figure or table of the page, so that on the page Including multiple pieces, such as block of information, image block, commercial block etc..Can according to each piece of particular content come to web page release, Piecemeal can be come to the simple webpage setting regions range of content.

Step S303 selects preset template according to the webpage and blocking information that are intended to Extracting Information.

For the page of piecemeal, can according to the URL of its webpage and the block in the position of the page from template database It is middle to select suitable preset template.

Step S304, using according to the URL for being intended to Extracting Information webpage and blocking information is selected presets template pair Webpage information is extracted.

The information in the piecemeal of webpage is extracted according to step S303 selected template.

The embodiment of the present invention is by increasing such as after the uniform resource locator (URL) for obtaining desire Extracting Information webpage Lower step: carrying out piecemeal to the page, selects preset template specifically excellent according to the URL for being intended to Extracting Information webpage for described It turns to: preset template is selected according to the URL and blocking information that are intended to Extracting Information webpage；The use is selected pre- The template first set extracts webpage information and specifically includes: using according to selected by the URL and blocking information for being intended to Extracting Information webpage Preset template webpage information is extracted.The webpage of Extracting Information is subjected to piecemeal, according to blocking information and net Page URL chooses suitable template and extracts to webpage information, to accelerate extraction speed, also further enhances extraction The accuracy of information.

Example IV

Fig. 6 shows the fourth embodiment of the present invention.

Fig. 6 is the flow chart of the abstracting method for the webpage information that fourth embodiment of the invention provides, the webpage information Abstracting method based on 3rd embodiment, further, piecemeal will be carried out to the page and specifically optimized are as follows: the traversal page is all Label determines the segmented areas that continuous label is constituted.

Referring to Fig. 6, the abstracting method of the webpage information includes:

Step S401 obtains the uniform resource locator (URL) for being intended to Extracting Information webpage.

Step S402 traverses all separation labels of the page.

It can be marked using corresponding label in the page of Extracting Information according to different contents, such as in the super of the page Text mark up language (HyperText Mark-up Language, HTML；).Text file retouches block of information using label It states, such as<bcginTag></beginTag>,<endTag></endTag>with<divideTag></divideTag>, wherein< BcginTag></bcginTag>with<.endTag></endTag>it, can be with according to them for indicating the initial position of block of information Block of information is found in Html page source file.<divideTag></divideTag>for indicating to play segmentation within block of information The mark of effect.All labels of the page can be traversed according to the html text file of the page of Extracting Information.

Step S403 determines the segmented areas that continuous label is constituted.

According to step S402 traversal all labels of the page as a result, continuous label can be searched out.Such as<bcginTag> </beginTag>,<endTag></endTag>, information in the label in included content i.e. this section of piecemeal.In block of information Portion is made of multiple contents, identic part.And<divideTag></divideTag>for indicating within block of information Play the mark of segmentation, that is, is used to distinguish each information sub-block in information bulk.

Step S404 selects preset template according to be intended to Extracting Information webpage and blocking information.

Step S405, using according to the URL for being intended to Extracting Information webpage and blocking information is selected presets template pair Webpage information is extracted.

The embodiment of the present invention is specifically optimized by that will carry out piecemeal to the page are as follows: traversal all labels of the page determine continuous The segmented areas that label is constituted.Accurate piecemeal accurately can be carried out according to the content in webpage, further improved The accuracy of Extracting Information.

Embodiment five

Fig. 7 is the flow chart of the abstracting method for the webpage information that fifth embodiment of the invention provides, the webpage information Abstracting method based on fourth embodiment, further, the segmented areas that the continuous label of determination is constituted has Body optimization are as follows: according to the weight for separating label weight computing and separating constituted segmented areas between label of setting；Determine weight Greater than the segmented areas constituted between the separation label of preset value.

Referring to Fig. 7, the abstracting method of the webpage information includes:

Step S501 obtains the uniform resource locator (URL) for being intended to Extracting Information webpage.

Step S502 traverses all labels of the page.

Step S503, according to the weight for separating label weight computing and separating constituted piecemeal between label of setting.

There is very big difference in web page release defined by separating between label, some piecemeals may have in many information Hold, some piecemeals may only have very few several words.Especially link piecemeal, it is evident that these link piecemeals be not need into What row extracted.If being also required to extract by template to these link piecemeals can waste quite greatly according to original method Resource, so need to separate label between constituted piecemeal consider, judge whether it needs to be taken out by template It takes.

In the present embodiment, by the interval piecemeal threshold value that is constituted between the segmentation tag of default settings to separating piecemeal Judged.Following procedure realization can be used:

N:=0；K:=0:TagSeg:=Φ；

The While Not Doc end of file

K:=k+l

: k-th of the html tag extracted from Doe

If Blank (), // there are continuous html tags

If ∈ S//there is continuous separation label

→TagSeg

End If

End

If// separating label segment

It calculates and separates segmentation weight corresponding to label segment,

EndElse

EndWhile

Step S504 determines that weight is greater than constituted segmented areas between the separation label of preset value.

According to the calculated result of step S503, the segmented areas for meeting given threshold can be put into identity set, it should Segmented areas, that is, weight in set is greater than constituted segmented areas between the separation label of preset value.Code is as follows:

IfS_ws>=S ' // separation label segment constitutes interval

<B_n,TagSeg_n>→Q

EndIf

// empty separation tag set

Step S505, according to the webpage for being intended to Extracting Information and blocking information select preset template.

Step S506, the preset template selected using the URL and blocking information according to the webpage for being intended to Extracting Information Webpage information is extracted.

The embodiment of the present invention is specifically optimized by the segmented areas for being constituted the continuous label of the determination are as follows: according to setting The fixed weight for separating label weight computing and separating constituted segmented areas between label；Determine that weight is greater than the separation of preset value Constituted segmented areas between label.The segmented areas of the page can be judged, remove the segmented areas that need not be extracted, subtract Template is selected less and extracts the work of segmented areas information using template, is reduced the workload of Extracting Information, is accelerated extraction The speed of information, while also enhancing the accuracy of Extracting Information.

It is public to the listing in Sina, Sohu, the big website of Tencent three using method for abstracting web page information provided in this embodiment Department's financial data report messages are extracted, as a result as follows:

Embodiment six

Fig. 8 shows sixth embodiment of the invention.

Fig. 8 is the structure chart of the draw-out device for the webpage information that sixth embodiment of the invention provides.

As seen from Figure 8, the draw-out device of the webpage information includes: URL acquiring unit 610, stencil-chosen list Member 620 and Web page information extraction unit 630.

Wherein, the URL acquiring unit, for obtaining the uniform resource locator for being intended to Extracting Information webpage

(URL)；

Further, the URL acquiring unit is specifically used for:

It obtains the URL for being intended to Extracting Information webpage and is intended to URL included by Extracting Information webpage；

The template selection unit is specifically used for:

According to the URL for being intended to Extracting Information webpage and it is intended to the preset template of the selection of URL included by Extracting Information webpage.

Further, the draw-out device of the webpage information further includes blocking unit 640.

The blocking unit, for carrying out piecemeal to the page；

The template selection unit is specifically used for:

Preset template is selected according to the URL for the webpage for being intended to Extracting Information and blocking information；

The Web page information extraction unit is specifically used for:

Using the URL and the selected template that presets of blocking information according to the webpage for being intended to Extracting Information to webpage information It is extracted.

Further, the blocking unit further include: Traversal Unit 641 and segmented areas determination unit 642.

Wherein, the Traversal Unit is for traversing all separation labels of the page；

Segmented areas determination unit is for determining the continuous segmented areas for separating label and being constituted.

Further, the segmented areas determination unit includes: that weight calculation unit 6421 and second area determine list Member 6422.

Wherein, the weight calculation unit is used to be separated according to the separation label weight computing of setting and be constituted between label The weight in region；

Second area determination unit is used to determine that weight to be greater than constituted segmented areas between the separation label of preset value.

The abstracting method of webpage information provided by the embodiment of the present invention, tool can be performed in the draw-out device of above-mentioned webpage information The standby corresponding functional module of execution method and beneficial effect.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

Will be appreciated by those skilled in the art that each module of the above invention or each step can use general meter Device is calculated to realize, they can be concentrated on single computing device, or be distributed in network constituted by multiple computing devices On, optionally, they can be realized with the program code that computer installation can be performed, so as to be stored in storage It is performed by computing device in device, perhaps they are fabricated to each integrated circuit modules or will be more in them A module or step are fabricated to single integrated circuit module to realize.In this way, the present invention is not limited to any specific hardware and The combination of software.

All the embodiments in this specification are described in a progressive manner, the highlights of each of the examples are with The difference of other embodiments, the same or similar part between each embodiment may refer to each other.

The above description is only a preferred embodiment of the present invention, is not intended to restrict the invention, for those skilled in the art For, the invention can have various changes and changes.All any modifications made within the spirit and principles of the present invention are equal Replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of abstracting method of webpage information, comprising:

Webpage information is extracted using the preset template of selection；

Obtain the URL for being intended to Extracting Information webpage, further includes:

Piecemeal is carried out to the page；

Described selects preset template to specifically include according to the URL for being intended to Extracting Information webpage: according to desire Extracting Information net The URL and blocking information of page, select preset template；

Described is specifically included using selected preset template extraction webpage information:

Using according to be intended to Extracting Information webpage URL and blocking information, selected preset template, to webpage information into Row extracts；

Wherein, described that page progress piecemeal is specifically included:

Traverse all separation labels of the page；

Determine the continuous segmented areas for separating label and being constituted；

Wherein, the piecemeal that the continuous label of the determination is constituted specifically includes:

According to the separation label weight of setting, the weight for separating constituted piecemeal between label is calculated；

Determine weight greater than the segmented areas constituted between the separation label of preset value.

2. the abstracting method of webpage information according to claim 1, which is characterized in that the acquisition is intended to Extracting Information net The uniform resource locator URL of page is specifically included:

It is described to select preset template to be specifically used for according to the URL for being intended to Extracting Information webpage:

According to the URL for being intended to Extracting Information webpage and it is intended to URL included by Extracting Information webpage, selects preset template.

3. a kind of draw-out device of webpage information, comprising:

Web page information extraction unit, for extracting webpage information using selected preset template；

The draw-out device of the webpage information further include:

Blocking unit, for carrying out piecemeal to the page；

The template selection unit is specifically used for:

According to the URL and blocking information for being intended to Extracting Information webpage, preset template is selected；

The Web page information extraction unit is specifically used for:

Using according to the URL for being intended to Extracting Information webpage and blocking information is selected presets template, webpage information is carried out It extracts；

Wherein, the blocking unit further include:

Traversal Unit, for traversing all separation labels of the page；

Segmented areas determination unit, for determining the continuous segmented areas for separating label and being constituted；

Wherein, the segmented areas determination unit includes:

Weight calculation unit calculates the weight for separating constituted region between label for the separation label weight according to setting；

Second area determination unit, for determining weight greater than the segmented areas constituted between the separation label of preset value.

4. the draw-out device of webpage information according to claim 3, which is characterized in that the URL acquiring unit is specific For:

The template selection unit is specifically used for: