CN104572874A

CN104572874A - Webpage information extraction method and device

Info

Publication number: CN104572874A
Application number: CN201410804430.9A
Authority: CN
Inventors: 刘雄伟
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2014-12-19
Filing date: 2014-12-19
Publication date: 2015-04-29
Anticipated expiration: 2034-12-19
Also published as: CN104572874B

Abstract

The embodiment of the invention discloses a webpage information extraction method and device. The webpage information extraction method comprises the following steps: acquiring the uniform resource locator (URL) of a webpage from which information is to be extracted; selecting a preset template according to the URL of the webpage from which the information is to be extracted; extracting the webpage information by using the selected preset template. Therefore, the webpage information extraction accuracy is increased.

Description

A kind of abstracting method of info web and device

Technical field

The present invention relates to areas of information technology, particularly relate to a kind of abstracting method and device of info web.

Background technology

Along with the fast development of internet, the network media, as a kind of new Information Communication form, gos deep into daily life.Text Information Extraction technology is a kind of accurate, efficient information getting method.It is from one or more webpage, extract the information that the users such as entity, relation and the event of specifying need, and forms structurized data, presents to user.This method has the advantages such as content is accurate, redundance is little, organizational norms.

In the prior art, multiple technologies method is had to can be used for the extraction of many record webpages.As redaction rule can be adopted extract in traditional method.The method can extract recorded information quickly and accurately from specific data source.Growing along with network information, and the continuous renewal of web page contents, in the face of the Protean data of magnanimity, extract the relevant information of webpage, will inevitably reduce the accuracy rate of extraction by means of only single human configuration template.Even if only for the extraction of same field site pages information, because its webpage number is more, layout style is various and changeable, and existing technical method still can not improve the accuracy rate of Extracting Information effectively.

Summary of the invention

In view of this, the embodiment of the present invention proposes a kind of abstracting method and device of info web, to improve the accuracy rate extracting info web.

First aspect, embodiments provides a kind of abstracting method of info web, and described method comprises:

Obtain the uniform resource locator URL for Extracting Information webpage;

The template preset is selected according to the URL for Extracting Information webpage;

The selected template preset is used to extract info web.

Second aspect, embodiments provides a kind of draw-out device of info web, and described device comprises:

URL acquiring unit, for obtaining the uniform resource locator URL for Extracting Information webpage;

Template selection unit, for selecting the template preset according to the URL for Extracting Information webpage;

Web page information extraction unit, extracts info web for using the selected template preset.

The abstracting method of the info web that the embodiment of the present invention provides and device, by obtaining the uniform resource locator URL for Extracting Information webpage; The template preset is selected according to the URL for Extracting Information webpage; The selected template preset is used to extract info web.Thus improve the accuracy rate extracting info web.

Accompanying drawing explanation

By reading the detailed description done non-limiting example done with reference to the following drawings, other features, objects and advantages of the present invention will become more obvious:

Fig. 1 is the process flow diagram of the abstracting method of the info web that first embodiment of the invention provides;

Fig. 2 is the schematic diagram of the abstracting method of the info web that first embodiment of the invention provides;

Fig. 3 is the process flow diagram of the abstracting method of the info web that second embodiment of the invention provides;

Fig. 4 is the schematic diagram of the abstracting method of the info web that second embodiment of the invention provides;

Fig. 5 is the process flow diagram of the abstracting method of the info web that third embodiment of the invention provides;

Fig. 6 is the process flow diagram of the abstracting method of the info web that fourth embodiment of the invention provides;

Fig. 7 is the process flow diagram of the abstracting method of the info web that fifth embodiment of the invention provides;

Fig. 8 is the structural drawing of the drawing-out structure of the info web that sixth embodiment of the invention provides.

Embodiment

Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, illustrate only part related to the present invention in accompanying drawing but not full content.

Fig. 1 and Fig. 2 shows the first embodiment of the present invention.

Fig. 1 is the process flow diagram of the abstracting method of the info web that first embodiment of the invention provides; Fig. 2 is the schematic diagram of the abstracting method of described info web, and the abstracting method of described info web comprises:

Step S101, obtains the uniform resource locator (URL) for Extracting Information webpage.

Uniform resource locator (Uniform Resoure Locator, URL) is to can the accessed position of resource and a kind of expression succinctly of access method from internet, is the address of standard resource on internet.Each file on internet has unique URL, and the information that it comprises can point out that the position of file and browser are to the disposal route of this file.

In addition, URL also can be used as the address of WWW.For webpage that on the internet can be accessed, all exist and there is uniform resource locator URL.Therefore, for the webpage for Extracting Information, the URL of this webpage should first be obtained.Such as extracting the information of Netease's homepage, then need the URL (i.e. http://www.163.com/) first obtaining Netease's homepage.

Step S102, selects the template preset according to the URL of the webpage for Extracting Information.

Different websites can pre-set different templates, this is because the information tool that different websites is shown is very different, and such as Sina website and Taobao.Sina website is as comprehensive portal website, and its information of showing is based on news; Taobao is then based on commodity displaying.For above-mentioned two websites, the Extracting Information template adopted certainly exists larger difference.As adopted same extraction template, because the regular expression of extraction template institute Extracting Information only works to the character string with respective settings, accuracy rate will inevitably be reduced.Therefore, can select to the URL for Extracting Information webpage the template that presets accordingly by obtaining, thus improve the accuracy rate extracting info web.

Step S103, uses the template preset selected to extract info web.

The template preset selected by step S102, extract info web, described template can be one group of regular expression.Regular expression is a kind of logical formula to string operation, some specific characters that just referring to acts first defines and the combination of these specific characters, form one " regular character string ", this " regular character string " can be used for expressing a kind of filter logic to character string.

A given regular expression and another character string, can reach following object: judge whether character string meets the filter logic (being called " coupling ") of regular expression; Also can pass through regular expression, from character string, obtain the specific part that we want.

By the regular expression of setting, can identify from web page contents and extract the related content in webpage, remove irrelevant contents, and by the information be drawn into stored in the database of specifying, thus conveniently carry out inquiring about and checking.

The embodiment of the present invention is by obtaining the uniform resource locator (URL) for Extracting Information webpage, and the template that presets is selected according to the URL for Extracting Information webpage, and use the template preset selected to extract info web, thus improve the accuracy rate of Extracting Information.

Embodiment two

Fig. 3 and Fig. 4 shows the second embodiment of the present invention.

Fig. 3 is the process flow diagram of the abstracting method of the info web that second embodiment of the invention provides, the schematic diagram of the abstracting method of info web described in the abstracting method of the info web that Fig. 4 provides for second embodiment of the invention.The abstracting method of described info web is based on the first embodiment, further, the uniform resource locator (URL) obtained for the webpage of Extracting Information is specifically optimized for: obtain for the URL of Extracting Information webpage and for the URL included by Extracting Information webpage; Select the template preset specifically to be optimized for by according to the URL for Extracting Information webpage: according to the template selecting for the URL included by the URL of Extracting Information webpage and wish Extracting Information webpage to preset.

See Fig. 3 and Fig. 4, the abstracting method of described info web comprises:

Step S201, obtains for the URL of Extracting Information webpage and for the URL included by Extracting Information webpage.

Webpage inside for Extracting Information may comprise multiple link.Such as, the webpage for Extracting Information is the portal site of certain portal website.As Netease's homepage, its homepage comprises the link of some subelements, such as forum, news, finance and economics etc.The web page contents linked accordingly and pointed by link can be obtained by web crawlers.Web crawlers is a program automatically extracting webpage, from the URL of one or several Initial page, can obtain the URL on Initial page, in the process capturing webpage, constantly extract new URL from current page and put into queue.

Step S202, according to the template that the URL and selecting for the URL included by Extracting Information webpage for Extracting Information webpage presets.

Webpage inside for Extracting Information may comprise multiple link.Such as certain portal website's homepage, its homepage comprises the link of some subelements, as forum, news, finance and economics etc.Each subelement is very big due to content deltas, needs to choose the corresponding template preset according to the URL of each subelement, and template can be made up of one group of regular expression.

Step S203, uses the selected template preset to extract info web.

The embodiment of the present invention is by being specifically optimized for the uniform resource locator (URL) obtained for Extracting Information webpage: obtain for the URL of Extracting Information webpage and for the URL included by Extracting Information webpage; The template preset is selected specifically to be optimized for by according to the URL for Extracting Information webpage: according to the URL for Extracting Information webpage and for the URL included by Extracting Information webpage, to select the template preset.The web page contents URL and URL pointed by of web crawlers acquisition included by webpage can be used, and select suitable template to extract info web according to the URL comprised.So just when ensureing accuracy rate, automatically can complete the extraction of multiple info web fast.

Embodiment three

Fig. 5 shows the third embodiment of the present invention.

Fig. 5 is the process flow diagram of the abstracting method of the info web that third embodiment of the invention provides, the abstracting method of described info web is based on the first embodiment, further, after obtaining the uniform resource locator (URL) for Extracting Information webpage, increase following steps: piecemeal is carried out to the page, selects the template preset specifically to be optimized for described according to the URL for Extracting Information webpage: select the template preset according to the URL of the webpage of wish Extracting Information and blocking information; The template preset described use selected extracts info web and specifically comprises: use and extract info web according to for the template that presets selected by the URL of Extracting Information webpage and blocking information.

See Fig. 5, the abstracting method of described info web comprises:

Step S301, obtains the uniform resource locator (URL) for Extracting Information webpage.

Step S302, carries out piecemeal to the page.

The page of Extracting Information, by layout, carries out formatting to the word of the page, figure or form, makes the page comprises multiple pieces, such as message block, image block, commercial block etc.Can come web page release according to the particular content of each block, also can carry out piecemeal to content simple webpage setting regions scope.

Step S303, according to the template selecting for the webpage of Extracting Information and blocking information to preset.

For the page of piecemeal, the suitable template preset can be selected according to the URL of its webpage and this block in the position of the page from template database.

Step S304, uses and extracts info web according to for the template that presets selected by the URL of Extracting Information webpage and blocking information.

Template selected by step S303 extracts the information in the piecemeal of webpage.

The embodiment of the present invention is by after obtaining the uniform resource locator (URL) for Extracting Information webpage, increase following steps: piecemeal is carried out to the page, selects the template preset specifically to be optimized for described according to the URL for Extracting Information webpage: select the template preset according to the URL of wish Extracting Information webpage and blocking information; The template preset described use selected extracts info web and specifically comprises: use and extract info web according to for the template preset selected by the URL of Extracting Information webpage and blocking information.The webpage of Extracting Information is carried out piecemeal, chooses suitable template according to blocking information and webpage URL and info web is extracted, thus accelerate extraction speed, also further enhance the accuracy of Extracting Information.

Embodiment four

Fig. 6 shows the fourth embodiment of the present invention.

Fig. 6 is the process flow diagram of the abstracting method of the info web that fourth embodiment of the invention provides, the abstracting method of described info web is based on the 3rd embodiment, further, piecemeal will be carried out to the page to be specifically optimized for: all labels of the traversal page, determine the segmented areas that continuous label is formed.

See Fig. 6, the abstracting method of described info web comprises:

Step S401, obtains the uniform resource locator (URL) for Extracting Information webpage.

Step S402, all separation labels of the traversal page.

Corresponding label can be adopted to mark according to different contents, such as, at HTML (Hypertext Markup Language) (HyperText Mark-up Language, the HTML of the page in the page of Extracting Information; ).Text adopts label to be described message block, such as <bcginTag></begi nTag>, <endTag></endTag > and <divideTag></div ideTag>, wherein <bcginTag></bcgi nTag> and <.endTag></endTa g> is used for representing the reference position of message block, message block can be found in Html page source file according to them.<divideTag></div ideTag> is used for representing the mark playing segmentation effect within message block.All labels of this page can be traversed according to the html text file of the page of Extracting Information.

Step S403, determines the segmented areas that continuous label is formed.

Travel through the result of all labels of the page according to step S402, continuous label can be searched out.Such as <bcginTag></begi nTag>, <endTag></endTag >, the information in content i.e. this section of piecemeal included in this label.Message block inside is made up of multiple content, identic part.And <divideTag></div ideTag> is used for representing the mark playing segmentation effect within message block, each information sub-block be namely used in differentiation information bulk.

Step S404, according to for Extracting Information webpage and blocking information select the template that presets.

Step S405, uses and extracts info web according to for the template that presets selected by the URL of Extracting Information webpage and blocking information.

The embodiment of the present invention is specifically optimized for by carrying out piecemeal to the page: all labels of the traversal page, determines the segmented areas that continuous label is formed.Piecemeal accurately can be carried out according to the content in webpage accurately, further improve the accuracy of Extracting Information.

Embodiment five

Fig. 7 is the process flow diagram of the abstracting method of the info web that fifth embodiment of the invention provides, the abstracting method of described info web is based on the 4th embodiment, further, the segmented areas that continuous for described determination label is formed specifically is optimized for: according to the separation label weight computing of setting to separate between label form the weights of segmented areas; Determine that weights are greater than the segmented areas formed between the separation label of preset value.

See Fig. 7, the abstracting method of described info web comprises:

Step S501, obtains the uniform resource locator (URL) for Extracting Information webpage.

Step S502, all labels of the traversal page.

Step S503, according to the separation label weight computing of setting to separate between label form the weights of piecemeal.

Separate the web page release limited between label and there is very large difference, some piecemeals may have a lot of information content, and some piecemeals may only have very few several word.Particularly link piecemeal, clearly, these link piecemeals are not need to carry out extracting.If according to original method, these link piecemeals are also needed to carry out extraction by template and can waste sizable resource, thus need between separation label form piecemeal and consider, judge that it extracts the need of by template.

In the present embodiment, the interval formed between the segmentation tag by default settings divides block threshold value to judge separation piecemeal.Following program can be adopted to realize:

n:＝0；k:＝0：TagSeg:＝Φ；

The While Not Doc end of file

K:＝k+l

: the kth html tag extracted from Doe

If Blank (), // there is continuous html tag

If ∈ S//there is continuous print to separate label

→TagSeg

End If

End

If // in separation label section

Calculate the segmentation weights separated corresponding to label section,

EndElse

EndWhile

Step S504, to determine between the separation label that weights are greater than preset value form segmented areas.

According to the result of calculation of step S503, the segmented areas meeting setting threshold value can be put into identity set, between the separation label that the segmented areas in this set and weights are greater than preset value form segmented areas.Code is as follows:

IfS _ws>=S ' // separation label section forms interval

<B _n,TagSeg _n>→Q

EndIf

// empty separation tag set

Step S505, according to the webpage for Extracting Information and blocking information select the template that presets.

Step S506, uses the template preset selected according to URL and the blocking information of the webpage for Extracting Information to extract info web.

The embodiment of the present invention is by being specifically optimized for the segmented areas that the described continuous label determined is formed: according to the separation label weight computing of setting to separate between label form the weights of segmented areas; To determine between the separation label that weights are greater than preset value form segmented areas.Can judge the segmented areas of the page, the segmented areas that removal need not be extracted, reduces and selects template and use template to extract the work of segmented areas information, reduce the workload of Extracting Information, accelerate the speed of Extracting Information, also enhance the accuracy of Extracting Information simultaneously.

Use the method for abstracting web page information that the present embodiment provides, extract the Company Financial data sheet information in Sina, Sohu, the large website of Tengxun three, result is as follows:

Embodiment six

Fig. 8 illustrates sixth embodiment of the invention.

Fig. 8 is the structural drawing of the draw-out device of the info web that sixth embodiment of the invention provides.

As seen from Figure 8, the draw-out device of described info web comprises: URL acquiring unit 610, template selection unit 620 and Web page information extraction unit 630.

Wherein, described URL acquiring unit, for obtaining the uniform resource locator for Extracting Information webpage

(URL)；

Further, described URL acquiring unit specifically for:

Obtain for Extracting Information webpage URL and for included by Extracting Information webpage URL;

Described template selection unit specifically for:

According to the template that the URL and selecting for the URL included by Extracting Information webpage for Extracting Information webpage presets.

Further, the draw-out device of described info web also comprises blocking unit 640.

Described blocking unit, for carrying out piecemeal to the page;

Described template selection unit specifically for:

According to the template selecting for the URL of webpage of Extracting Information and blocking information to preset;

Described Web page information extraction unit specifically for:

The preset template of use selected by the URL of the webpage for Extracting Information and blocking information extracts info web.

Further, described blocking unit also comprises: Traversal Unit 641 and segmented areas determining unit 642.

Wherein, described Traversal Unit is for traveling through all separation labels of the page;

The segmented areas that segmented areas determining unit is formed for determining to separate continuously label.

Further, described segmented areas determining unit comprises: weight calculation unit 6421 and second area determining unit 6422.

Wherein, described weight calculation unit be used for according to the separation label weight computing of setting to separate between label form the weights in region;

Second area determining unit for determine between the separation label that weights are greater than preset value form segmented areas.

The draw-out device of above-mentioned info web can perform the abstracting method of the info web that the embodiment of the present invention provides, and possesses the corresponding functional module of manner of execution and beneficial effect.

The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.

Those of ordinary skill in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of computer installation, thus they storages can be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to the combination of any specific hardware and software.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, the same or analogous part between each embodiment mutually see.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, to those skilled in the art, the present invention can have various change and change.All do within spirit of the present invention and principle any amendment, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. an abstracting method for info web, comprising:

Obtain the uniform resource locator URL for Extracting Information webpage;

The template preset selected is used to extract info web.

2. the abstracting method of info web according to claim 1, is characterized in that, described acquisition specifically comprises for the uniform resource locator URL of Extracting Information webpage:

Obtain for the URL of Extracting Information webpage and for the URL included by Extracting Information webpage;

The described template selecting to preset according to the URL for Extracting Information webpage specifically for:

According to the URL for Extracting Information webpage and for the URL included by Extracting Information webpage, select the template preset.

3. the abstracting method of info web according to claim 1, is characterized in that, obtains the URL for Extracting Information webpage, also comprises:

Piecemeal is carried out to the page;

Described selects the template preset specifically to comprise according to the URL for Extracting Information webpage: according to URL and the blocking information of wish Extracting Information webpage, select the template preset;

The template preset selected by described use extracts info web and specifically comprises:

Use the URL according to wish Extracting Information webpage and blocking information, the selected template preset, extracts info web.

4. the abstracting method of info web according to claim 3, is characterized in that, described carry out piecemeal to the page and specifically comprises:

The all separation labels of the traversal page;

Determine to separate continuously the segmented areas that label is formed.

5. the abstracting method of info web according to claim 4, is characterized in that, the piecemeal that the continuous label of described determination is formed specifically comprises:

According to the separation label weights of setting, calculate to separate between label form the weights of piecemeal;

Determine that weights are greater than the segmented areas formed between the separation label of preset value.

6. a draw-out device for info web, comprising:

7. the draw-out device of info web according to claim 6, is characterized in that, described URL acquiring unit specifically for:

Described template selection unit specifically for:

8. the draw-out device of info web according to claim 6, is characterized in that, the draw-out device of described info web also comprises:

Blocking unit, for carrying out piecemeal to the page;

Described template selection unit specifically for:

According to URL and the blocking information of wish Extracting Information webpage, select the template preset;

Described Web page information extraction unit specifically for:

Use according to presetting template selected by the URL of Extracting Information webpage and blocking information, info web is extracted.

9. the draw-out device of info web according to claim 8, is characterized in that, described blocking unit also comprises:

Traversal Unit, for traveling through all separation labels of the page;

Segmented areas determining unit, for determining to separate continuously the segmented areas that label is formed.

10. the draw-out device of info web according to claim 9, is characterized in that, described segmented areas determining unit comprises:

Weight calculation unit, for the separation label weights according to setting, calculate to separate between label form the weights in region;

Second area determining unit, for determining that weights are greater than the segmented areas formed between the separation label of preset value.