CN104572874A - Webpage information extraction method and device - Google Patents

Webpage information extraction method and device Download PDF

Info

Publication number
CN104572874A
CN104572874A CN201410804430.9A CN201410804430A CN104572874A CN 104572874 A CN104572874 A CN 104572874A CN 201410804430 A CN201410804430 A CN 201410804430A CN 104572874 A CN104572874 A CN 104572874A
Authority
CN
China
Prior art keywords
url
extracting information
webpage
info web
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410804430.9A
Other languages
Chinese (zh)
Other versions
CN104572874B (en
Inventor
刘雄伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ruian Technology Co Ltd
Original Assignee
Beijing Ruian Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ruian Technology Co Ltd filed Critical Beijing Ruian Technology Co Ltd
Priority to CN201410804430.9A priority Critical patent/CN104572874B/en
Publication of CN104572874A publication Critical patent/CN104572874A/en
Application granted granted Critical
Publication of CN104572874B publication Critical patent/CN104572874B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The embodiment of the invention discloses a webpage information extraction method and device. The webpage information extraction method comprises the following steps: acquiring the uniform resource locator (URL) of a webpage from which information is to be extracted; selecting a preset template according to the URL of the webpage from which the information is to be extracted; extracting the webpage information by using the selected preset template. Therefore, the webpage information extraction accuracy is increased.

Description

A kind of abstracting method of info web and device
Technical field
The present invention relates to areas of information technology, particularly relate to a kind of abstracting method and device of info web.
Background technology
Along with the fast development of internet, the network media, as a kind of new Information Communication form, gos deep into daily life.Text Information Extraction technology is a kind of accurate, efficient information getting method.It is from one or more webpage, extract the information that the users such as entity, relation and the event of specifying need, and forms structurized data, presents to user.This method has the advantages such as content is accurate, redundance is little, organizational norms.
In the prior art, multiple technologies method is had to can be used for the extraction of many record webpages.As redaction rule can be adopted extract in traditional method.The method can extract recorded information quickly and accurately from specific data source.Growing along with network information, and the continuous renewal of web page contents, in the face of the Protean data of magnanimity, extract the relevant information of webpage, will inevitably reduce the accuracy rate of extraction by means of only single human configuration template.Even if only for the extraction of same field site pages information, because its webpage number is more, layout style is various and changeable, and existing technical method still can not improve the accuracy rate of Extracting Information effectively.
Summary of the invention
In view of this, the embodiment of the present invention proposes a kind of abstracting method and device of info web, to improve the accuracy rate extracting info web.
First aspect, embodiments provides a kind of abstracting method of info web, and described method comprises:
Obtain the uniform resource locator URL for Extracting Information webpage;
The template preset is selected according to the URL for Extracting Information webpage;
The selected template preset is used to extract info web.
Second aspect, embodiments provides a kind of draw-out device of info web, and described device comprises:
URL acquiring unit, for obtaining the uniform resource locator URL for Extracting Information webpage;
Template selection unit, for selecting the template preset according to the URL for Extracting Information webpage;
Web page information extraction unit, extracts info web for using the selected template preset.
The abstracting method of the info web that the embodiment of the present invention provides and device, by obtaining the uniform resource locator URL for Extracting Information webpage; The template preset is selected according to the URL for Extracting Information webpage; The selected template preset is used to extract info web.Thus improve the accuracy rate extracting info web.
Accompanying drawing explanation
By reading the detailed description done non-limiting example done with reference to the following drawings, other features, objects and advantages of the present invention will become more obvious:
Fig. 1 is the process flow diagram of the abstracting method of the info web that first embodiment of the invention provides;
Fig. 2 is the schematic diagram of the abstracting method of the info web that first embodiment of the invention provides;
Fig. 3 is the process flow diagram of the abstracting method of the info web that second embodiment of the invention provides;
Fig. 4 is the schematic diagram of the abstracting method of the info web that second embodiment of the invention provides;
Fig. 5 is the process flow diagram of the abstracting method of the info web that third embodiment of the invention provides;
Fig. 6 is the process flow diagram of the abstracting method of the info web that fourth embodiment of the invention provides;
Fig. 7 is the process flow diagram of the abstracting method of the info web that fifth embodiment of the invention provides;
Fig. 8 is the structural drawing of the drawing-out structure of the info web that sixth embodiment of the invention provides.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, illustrate only part related to the present invention in accompanying drawing but not full content.
Fig. 1 and Fig. 2 shows the first embodiment of the present invention.
Fig. 1 is the process flow diagram of the abstracting method of the info web that first embodiment of the invention provides; Fig. 2 is the schematic diagram of the abstracting method of described info web, and the abstracting method of described info web comprises:
Step S101, obtains the uniform resource locator (URL) for Extracting Information webpage.
Uniform resource locator (Uniform Resoure Locator, URL) is to can the accessed position of resource and a kind of expression succinctly of access method from internet, is the address of standard resource on internet.Each file on internet has unique URL, and the information that it comprises can point out that the position of file and browser are to the disposal route of this file.
In addition, URL also can be used as the address of WWW.For webpage that on the internet can be accessed, all exist and there is uniform resource locator URL.Therefore, for the webpage for Extracting Information, the URL of this webpage should first be obtained.Such as extracting the information of Netease's homepage, then need the URL (i.e. http://www.163.com/) first obtaining Netease's homepage.
Step S102, selects the template preset according to the URL of the webpage for Extracting Information.
Different websites can pre-set different templates, this is because the information tool that different websites is shown is very different, and such as Sina website and Taobao.Sina website is as comprehensive portal website, and its information of showing is based on news; Taobao is then based on commodity displaying.For above-mentioned two websites, the Extracting Information template adopted certainly exists larger difference.As adopted same extraction template, because the regular expression of extraction template institute Extracting Information only works to the character string with respective settings, accuracy rate will inevitably be reduced.Therefore, can select to the URL for Extracting Information webpage the template that presets accordingly by obtaining, thus improve the accuracy rate extracting info web.
Step S103, uses the template preset selected to extract info web.
The template preset selected by step S102, extract info web, described template can be one group of regular expression.Regular expression is a kind of logical formula to string operation, some specific characters that just referring to acts first defines and the combination of these specific characters, form one " regular character string ", this " regular character string " can be used for expressing a kind of filter logic to character string.
A given regular expression and another character string, can reach following object: judge whether character string meets the filter logic (being called " coupling ") of regular expression; Also can pass through regular expression, from character string, obtain the specific part that we want.
By the regular expression of setting, can identify from web page contents and extract the related content in webpage, remove irrelevant contents, and by the information be drawn into stored in the database of specifying, thus conveniently carry out inquiring about and checking.
The embodiment of the present invention is by obtaining the uniform resource locator (URL) for Extracting Information webpage, and the template that presets is selected according to the URL for Extracting Information webpage, and use the template preset selected to extract info web, thus improve the accuracy rate of Extracting Information.
Embodiment two
Fig. 3 and Fig. 4 shows the second embodiment of the present invention.
Fig. 3 is the process flow diagram of the abstracting method of the info web that second embodiment of the invention provides, the schematic diagram of the abstracting method of info web described in the abstracting method of the info web that Fig. 4 provides for second embodiment of the invention.The abstracting method of described info web is based on the first embodiment, further, the uniform resource locator (URL) obtained for the webpage of Extracting Information is specifically optimized for: obtain for the URL of Extracting Information webpage and for the URL included by Extracting Information webpage; Select the template preset specifically to be optimized for by according to the URL for Extracting Information webpage: according to the template selecting for the URL included by the URL of Extracting Information webpage and wish Extracting Information webpage to preset.
See Fig. 3 and Fig. 4, the abstracting method of described info web comprises:
Step S201, obtains for the URL of Extracting Information webpage and for the URL included by Extracting Information webpage.
Webpage inside for Extracting Information may comprise multiple link.Such as, the webpage for Extracting Information is the portal site of certain portal website.As Netease's homepage, its homepage comprises the link of some subelements, such as forum, news, finance and economics etc.The web page contents linked accordingly and pointed by link can be obtained by web crawlers.Web crawlers is a program automatically extracting webpage, from the URL of one or several Initial page, can obtain the URL on Initial page, in the process capturing webpage, constantly extract new URL from current page and put into queue.
Step S202, according to the template that the URL and selecting for the URL included by Extracting Information webpage for Extracting Information webpage presets.
Webpage inside for Extracting Information may comprise multiple link.Such as certain portal website's homepage, its homepage comprises the link of some subelements, as forum, news, finance and economics etc.Each subelement is very big due to content deltas, needs to choose the corresponding template preset according to the URL of each subelement, and template can be made up of one group of regular expression.
Step S203, uses the selected template preset to extract info web.
The embodiment of the present invention is by being specifically optimized for the uniform resource locator (URL) obtained for Extracting Information webpage: obtain for the URL of Extracting Information webpage and for the URL included by Extracting Information webpage; The template preset is selected specifically to be optimized for by according to the URL for Extracting Information webpage: according to the URL for Extracting Information webpage and for the URL included by Extracting Information webpage, to select the template preset.The web page contents URL and URL pointed by of web crawlers acquisition included by webpage can be used, and select suitable template to extract info web according to the URL comprised.So just when ensureing accuracy rate, automatically can complete the extraction of multiple info web fast.
Embodiment three
Fig. 5 shows the third embodiment of the present invention.
Fig. 5 is the process flow diagram of the abstracting method of the info web that third embodiment of the invention provides, the abstracting method of described info web is based on the first embodiment, further, after obtaining the uniform resource locator (URL) for Extracting Information webpage, increase following steps: piecemeal is carried out to the page, selects the template preset specifically to be optimized for described according to the URL for Extracting Information webpage: select the template preset according to the URL of the webpage of wish Extracting Information and blocking information; The template preset described use selected extracts info web and specifically comprises: use and extract info web according to for the template that presets selected by the URL of Extracting Information webpage and blocking information.
See Fig. 5, the abstracting method of described info web comprises:
Step S301, obtains the uniform resource locator (URL) for Extracting Information webpage.
Step S302, carries out piecemeal to the page.
The page of Extracting Information, by layout, carries out formatting to the word of the page, figure or form, makes the page comprises multiple pieces, such as message block, image block, commercial block etc.Can come web page release according to the particular content of each block, also can carry out piecemeal to content simple webpage setting regions scope.
Step S303, according to the template selecting for the webpage of Extracting Information and blocking information to preset.
For the page of piecemeal, the suitable template preset can be selected according to the URL of its webpage and this block in the position of the page from template database.
Step S304, uses and extracts info web according to for the template that presets selected by the URL of Extracting Information webpage and blocking information.
Template selected by step S303 extracts the information in the piecemeal of webpage.
The embodiment of the present invention is by after obtaining the uniform resource locator (URL) for Extracting Information webpage, increase following steps: piecemeal is carried out to the page, selects the template preset specifically to be optimized for described according to the URL for Extracting Information webpage: select the template preset according to the URL of wish Extracting Information webpage and blocking information; The template preset described use selected extracts info web and specifically comprises: use and extract info web according to for the template preset selected by the URL of Extracting Information webpage and blocking information.The webpage of Extracting Information is carried out piecemeal, chooses suitable template according to blocking information and webpage URL and info web is extracted, thus accelerate extraction speed, also further enhance the accuracy of Extracting Information.
Embodiment four
Fig. 6 shows the fourth embodiment of the present invention.
Fig. 6 is the process flow diagram of the abstracting method of the info web that fourth embodiment of the invention provides, the abstracting method of described info web is based on the 3rd embodiment, further, piecemeal will be carried out to the page to be specifically optimized for: all labels of the traversal page, determine the segmented areas that continuous label is formed.
See Fig. 6, the abstracting method of described info web comprises:
Step S401, obtains the uniform resource locator (URL) for Extracting Information webpage.
Step S402, all separation labels of the traversal page.
Corresponding label can be adopted to mark according to different contents, such as, at HTML (Hypertext Markup Language) (HyperText Mark-up Language, the HTML of the page in the page of Extracting Information; ).Text adopts label to be described message block, such as <bcginTag></begi nTag>, <endTag></endTag > and <divideTag></div ideTag>, wherein <bcginTag></bcgi nTag> and <.endTag></endTa g> is used for representing the reference position of message block, message block can be found in Html page source file according to them.<divideTag></div ideTag> is used for representing the mark playing segmentation effect within message block.All labels of this page can be traversed according to the html text file of the page of Extracting Information.
Step S403, determines the segmented areas that continuous label is formed.
Travel through the result of all labels of the page according to step S402, continuous label can be searched out.Such as <bcginTag></begi nTag>, <endTag></endTag >, the information in content i.e. this section of piecemeal included in this label.Message block inside is made up of multiple content, identic part.And <divideTag></div ideTag> is used for representing the mark playing segmentation effect within message block, each information sub-block be namely used in differentiation information bulk.
Step S404, according to for Extracting Information webpage and blocking information select the template that presets.
Step S405, uses and extracts info web according to for the template that presets selected by the URL of Extracting Information webpage and blocking information.
The embodiment of the present invention is specifically optimized for by carrying out piecemeal to the page: all labels of the traversal page, determines the segmented areas that continuous label is formed.Piecemeal accurately can be carried out according to the content in webpage accurately, further improve the accuracy of Extracting Information.
Embodiment five
Fig. 7 is the process flow diagram of the abstracting method of the info web that fifth embodiment of the invention provides, the abstracting method of described info web is based on the 4th embodiment, further, the segmented areas that continuous for described determination label is formed specifically is optimized for: according to the separation label weight computing of setting to separate between label form the weights of segmented areas; Determine that weights are greater than the segmented areas formed between the separation label of preset value.
See Fig. 7, the abstracting method of described info web comprises:
Step S501, obtains the uniform resource locator (URL) for Extracting Information webpage.
Step S502, all labels of the traversal page.
Step S503, according to the separation label weight computing of setting to separate between label form the weights of piecemeal.
Separate the web page release limited between label and there is very large difference, some piecemeals may have a lot of information content, and some piecemeals may only have very few several word.Particularly link piecemeal, clearly, these link piecemeals are not need to carry out extracting.If according to original method, these link piecemeals are also needed to carry out extraction by template and can waste sizable resource, thus need between separation label form piecemeal and consider, judge that it extracts the need of by template.
In the present embodiment, the interval formed between the segmentation tag by default settings divides block threshold value to judge separation piecemeal.Following program can be adopted to realize:
n:=0;k:=0:TagSeg:=Φ;
The While Not Doc end of file
K:=k+l
: the kth html tag extracted from Doe
If Blank (), // there is continuous html tag
If ∈ S//there is continuous print to separate label
→TagSeg
End If
End
If // in separation label section
Calculate the segmentation weights separated corresponding to label section,
EndElse
EndWhile
Step S504, to determine between the separation label that weights are greater than preset value form segmented areas.
According to the result of calculation of step S503, the segmented areas meeting setting threshold value can be put into identity set, between the separation label that the segmented areas in this set and weights are greater than preset value form segmented areas.Code is as follows:
IfS ws>=S ' // separation label section forms interval
<B n,TagSeg n>→Q
EndIf
EndIf
// empty separation tag set
Step S505, according to the webpage for Extracting Information and blocking information select the template that presets.
Step S506, uses the template preset selected according to URL and the blocking information of the webpage for Extracting Information to extract info web.
The embodiment of the present invention is by being specifically optimized for the segmented areas that the described continuous label determined is formed: according to the separation label weight computing of setting to separate between label form the weights of segmented areas; To determine between the separation label that weights are greater than preset value form segmented areas.Can judge the segmented areas of the page, the segmented areas that removal need not be extracted, reduces and selects template and use template to extract the work of segmented areas information, reduce the workload of Extracting Information, accelerate the speed of Extracting Information, also enhance the accuracy of Extracting Information simultaneously.
Use the method for abstracting web page information that the present embodiment provides, extract the Company Financial data sheet information in Sina, Sohu, the large website of Tengxun three, result is as follows:
Embodiment six
Fig. 8 illustrates sixth embodiment of the invention.
Fig. 8 is the structural drawing of the draw-out device of the info web that sixth embodiment of the invention provides.
As seen from Figure 8, the draw-out device of described info web comprises: URL acquiring unit 610, template selection unit 620 and Web page information extraction unit 630.
Wherein, described URL acquiring unit, for obtaining the uniform resource locator for Extracting Information webpage
(URL);
Template selection unit, for selecting the template preset according to the URL for Extracting Information webpage;
Web page information extraction unit, extracts info web for using the selected template preset.
Further, described URL acquiring unit specifically for:
Obtain for Extracting Information webpage URL and for included by Extracting Information webpage URL;
Described template selection unit specifically for:
According to the template that the URL and selecting for the URL included by Extracting Information webpage for Extracting Information webpage presets.
Further, the draw-out device of described info web also comprises blocking unit 640.
Described blocking unit, for carrying out piecemeal to the page;
Described template selection unit specifically for:
According to the template selecting for the URL of webpage of Extracting Information and blocking information to preset;
Described Web page information extraction unit specifically for:
The preset template of use selected by the URL of the webpage for Extracting Information and blocking information extracts info web.
Further, described blocking unit also comprises: Traversal Unit 641 and segmented areas determining unit 642.
Wherein, described Traversal Unit is for traveling through all separation labels of the page;
The segmented areas that segmented areas determining unit is formed for determining to separate continuously label.
Further, described segmented areas determining unit comprises: weight calculation unit 6421 and second area determining unit 6422.
Wherein, described weight calculation unit be used for according to the separation label weight computing of setting to separate between label form the weights in region;
Second area determining unit for determine between the separation label that weights are greater than preset value form segmented areas.
The draw-out device of above-mentioned info web can perform the abstracting method of the info web that the embodiment of the present invention provides, and possesses the corresponding functional module of manner of execution and beneficial effect.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
Those of ordinary skill in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of computer installation, thus they storages can be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to the combination of any specific hardware and software.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, the same or analogous part between each embodiment mutually see.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, to those skilled in the art, the present invention can have various change and change.All do within spirit of the present invention and principle any amendment, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. an abstracting method for info web, comprising:
Obtain the uniform resource locator URL for Extracting Information webpage;
The template preset is selected according to the URL for Extracting Information webpage;
The template preset selected is used to extract info web.
2. the abstracting method of info web according to claim 1, is characterized in that, described acquisition specifically comprises for the uniform resource locator URL of Extracting Information webpage:
Obtain for the URL of Extracting Information webpage and for the URL included by Extracting Information webpage;
The described template selecting to preset according to the URL for Extracting Information webpage specifically for:
According to the URL for Extracting Information webpage and for the URL included by Extracting Information webpage, select the template preset.
3. the abstracting method of info web according to claim 1, is characterized in that, obtains the URL for Extracting Information webpage, also comprises:
Piecemeal is carried out to the page;
Described selects the template preset specifically to comprise according to the URL for Extracting Information webpage: according to URL and the blocking information of wish Extracting Information webpage, select the template preset;
The template preset selected by described use extracts info web and specifically comprises:
Use the URL according to wish Extracting Information webpage and blocking information, the selected template preset, extracts info web.
4. the abstracting method of info web according to claim 3, is characterized in that, described carry out piecemeal to the page and specifically comprises:
The all separation labels of the traversal page;
Determine to separate continuously the segmented areas that label is formed.
5. the abstracting method of info web according to claim 4, is characterized in that, the piecemeal that the continuous label of described determination is formed specifically comprises:
According to the separation label weights of setting, calculate to separate between label form the weights of piecemeal;
Determine that weights are greater than the segmented areas formed between the separation label of preset value.
6. a draw-out device for info web, comprising:
URL acquiring unit, for obtaining the uniform resource locator URL for Extracting Information webpage;
Template selection unit, for selecting the template preset according to the URL for Extracting Information webpage;
Web page information extraction unit, extracts info web for using the selected template preset.
7. the draw-out device of info web according to claim 6, is characterized in that, described URL acquiring unit specifically for:
Obtain for the URL of Extracting Information webpage and for the URL included by Extracting Information webpage;
Described template selection unit specifically for:
According to the URL for Extracting Information webpage and for the URL included by Extracting Information webpage, select the template preset.
8. the draw-out device of info web according to claim 6, is characterized in that, the draw-out device of described info web also comprises:
Blocking unit, for carrying out piecemeal to the page;
Described template selection unit specifically for:
According to URL and the blocking information of wish Extracting Information webpage, select the template preset;
Described Web page information extraction unit specifically for:
Use according to presetting template selected by the URL of Extracting Information webpage and blocking information, info web is extracted.
9. the draw-out device of info web according to claim 8, is characterized in that, described blocking unit also comprises:
Traversal Unit, for traveling through all separation labels of the page;
Segmented areas determining unit, for determining to separate continuously the segmented areas that label is formed.
10. the draw-out device of info web according to claim 9, is characterized in that, described segmented areas determining unit comprises:
Weight calculation unit, for the separation label weights according to setting, calculate to separate between label form the weights in region;
Second area determining unit, for determining that weights are greater than the segmented areas formed between the separation label of preset value.
CN201410804430.9A 2014-12-19 2014-12-19 A kind of abstracting method and device of webpage information Active CN104572874B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410804430.9A CN104572874B (en) 2014-12-19 2014-12-19 A kind of abstracting method and device of webpage information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410804430.9A CN104572874B (en) 2014-12-19 2014-12-19 A kind of abstracting method and device of webpage information

Publications (2)

Publication Number Publication Date
CN104572874A true CN104572874A (en) 2015-04-29
CN104572874B CN104572874B (en) 2019-03-05

Family

ID=53088936

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410804430.9A Active CN104572874B (en) 2014-12-19 2014-12-19 A kind of abstracting method and device of webpage information

Country Status (1)

Country Link
CN (1) CN104572874B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160209A (en) * 2015-08-31 2015-12-16 佛山市恒南微科技有限公司 System for investigating and managing regional enterprise software copyright announcement
CN106815273A (en) * 2015-12-02 2017-06-09 北京国双科技有限公司 Date storage method and device
CN109933717A (en) * 2019-01-17 2019-06-25 华南理工大学 A kind of academic conference recommender system based on mixing proposed algorithm
CN110020236A (en) * 2017-08-29 2019-07-16 北京国双科技有限公司 Web analysis method, apparatus, storage medium, processor and equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN101916285A (en) * 2010-08-20 2010-12-15 北京新岸线网络技术有限公司 Method and device for analyzing internet web page contents
CN102591971A (en) * 2011-12-31 2012-07-18 北京百度网讯科技有限公司 Method and device for extracting webpage information
CN102651002A (en) * 2011-02-28 2012-08-29 腾讯科技(深圳)有限公司 Webpage information extracting method and system

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101192234A (en) * 2007-06-07 2008-06-04 腾讯科技(深圳)有限公司 Searching system and method based on web page extraction
CN101916285A (en) * 2010-08-20 2010-12-15 北京新岸线网络技术有限公司 Method and device for analyzing internet web page contents
CN102651002A (en) * 2011-02-28 2012-08-29 腾讯科技(深圳)有限公司 Webpage information extracting method and system
CN102591971A (en) * 2011-12-31 2012-07-18 北京百度网讯科技有限公司 Method and device for extracting webpage information

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160209A (en) * 2015-08-31 2015-12-16 佛山市恒南微科技有限公司 System for investigating and managing regional enterprise software copyright announcement
CN106815273A (en) * 2015-12-02 2017-06-09 北京国双科技有限公司 Date storage method and device
CN106815273B (en) * 2015-12-02 2020-07-31 北京国双科技有限公司 Data storage method and device
CN110020236A (en) * 2017-08-29 2019-07-16 北京国双科技有限公司 Web analysis method, apparatus, storage medium, processor and equipment
CN109933717A (en) * 2019-01-17 2019-06-25 华南理工大学 A kind of academic conference recommender system based on mixing proposed algorithm
CN109933717B (en) * 2019-01-17 2021-05-14 华南理工大学 Academic conference recommendation system based on hybrid recommendation algorithm

Also Published As

Publication number Publication date
CN104572874B (en) 2019-03-05

Similar Documents

Publication Publication Date Title
CN101727461B (en) Method for extracting content of web page
CN102253979B (en) Vision-based web page extracting method
CN102662969B (en) Internet information object positioning method based on webpage structure semantic meaning
CN104598577B (en) A kind of extracting method of Web page text
CN102270206A (en) Method and device for capturing valid web page contents
US8205153B2 (en) Information extraction combining spatial and textual layout cues
CN104331438B (en) To novel web page contents selectivity abstracting method and device
CN104142985A (en) Semi-automatic vertical crawler generation tool and method
CN104572934A (en) Webpage key content extracting method based on DOM
CN104572874A (en) Webpage information extraction method and device
Uzun et al. An effective and efficient Web content extractor for optimizing the crawling process
CN107590288B (en) Method and device for extracting webpage image-text blocks
Nyein Mining contents in Web page using cosine similarity
Liu et al. Main content extraction from web pages based on node characteristics
CN107145591A (en) A kind of effective content metadata extracting method of webpage based on title
Yu et al. Web content information extraction based on DOM tree and statistical information
CN103942332A (en) Web page logic link block identification method
CN109740097A (en) A kind of Web page text extracting method of logic-based chained block
CN109948015B (en) Meta search list result extraction method and system
Wang et al. A novel web page text information extraction method
CN106897287A (en) Homepage Publishing decimation in time method and the device for Homepage Publishing decimation in time
CN103488743B (en) Page element extraction method and page element extraction system
CN115238078A (en) Webpage information extraction method, device, equipment and storage medium
KR101544142B1 (en) Searching method and system based on topic
JP5610215B2 (en) SEARCH DEVICE, SEARCH SYSTEM, SEARCH METHOD, AND SEARCH PROGRAM

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant