CN105468730A - Webpage information extraction method and equipment - Google Patents

Webpage information extraction method and equipment Download PDF

Info

Publication number
CN105468730A
CN105468730A CN201510815150.2A CN201510815150A CN105468730A CN 105468730 A CN105468730 A CN 105468730A CN 201510815150 A CN201510815150 A CN 201510815150A CN 105468730 A CN105468730 A CN 105468730A
Authority
CN
China
Prior art keywords
source file
webpage source
expression formula
info web
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510815150.2A
Other languages
Chinese (zh)
Inventor
陈仕明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Huaduo Network Technology Co Ltd
Original Assignee
Guangzhou Huaduo Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Huaduo Network Technology Co Ltd filed Critical Guangzhou Huaduo Network Technology Co Ltd
Priority to CN201510815150.2A priority Critical patent/CN105468730A/en
Publication of CN105468730A publication Critical patent/CN105468730A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention discloses a webpage information extraction method and equipment. The method comprises the following steps: obtaining a webpage source file corresponding to an input webpage address; obtaining characteristic description information corresponding to webpage information to be extracted in the webpage source file; according to the characteristic description information, generating an information extraction expression associated with the webpage information, wherein the information extraction expression is a JQuery expression; and adopting an embedded browser to load the webpage source file corresponding to the webpage address, and calling the information extraction expression to extract the webpage information after the webpage source file finishes being loaded. The complexity of a rule used for defining webpage information extraction can be lowered, and development cost is lowered.

Description

A kind of info web extracting method and equipment thereof
Technical field
The present invention relates to Internet technical field, particularly relate to a kind of info web extracting method and equipment thereof.
Background technology
Along with the development of Internet technology, nowadays the info web comprised in internet gets more and more, the project of many exploitations all needs to be extracted part info web by outer station webpage, existing info web extracting mode is to the HyperText Markup Language (HyperTextMarkupLanguage obtained by regular expression, HTML) content carries out analysis acquisition, due to the complex structure of regular expression, thus add the complexity of the rule that definition info web extracts, add cost of development.
Summary of the invention
The embodiment of the present invention provides a kind of info web extracting method and equipment thereof, can reduce the complexity of the rule that definition info web extracts, reduce cost of development.
Embodiment of the present invention first aspect provides a kind of info web extracting method, can comprise:
The webpage source file that the web page address that acquisition inputs is corresponding, and in described webpage source file, obtain info web characteristic of correspondence descriptor to be extracted;
Generate the information extraction expression formula be associated with described info web according to described feature interpretation information, described information extraction expression formula is JQuery expression formula;
Adopt built-in browser to load webpage source file corresponding to described web page address, and call described information extraction expression formula extract described info web after described webpage source file has loaded.
Embodiment of the present invention second aspect provides a kind of info web extraction equipment, can comprise:
Information acquisition unit, for obtaining webpage source file corresponding to inputted web page address, and obtains info web characteristic of correspondence descriptor to be extracted in described webpage source file;
Expression formula generation unit, for generating the information extraction expression formula be associated with described info web according to described feature interpretation information, described information extraction expression formula is JQuery expression formula;
Information extraction unit, for adopting built-in browser to load webpage source file corresponding to described web page address, and calls described information extraction expression formula extract described info web after described webpage source file has loaded.
In embodiments of the present invention, the webpage source file that the web page address inputted by acquisition is corresponding, info web characteristic of correspondence descriptor to be extracted is obtained in webpage source file, the information extraction expression formula of the JQuery be associated with info web is generated again according to feature interpretation information, the webpage source file that final employing built-in browser Web page loading address is corresponding, and recalls information extraction expression formula extracts info web after webpage source file has loaded.By the process that the information extraction expression formula adopting built-in browser to perform JQuery is extracted to realize info web, instead of and adopt regular expression to carry out the process extracted, simplify the structure of expression formula, and then reduce the complexity of the rule that definition info web extracts, and reduce cost of development.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the schematic flow sheet of a kind of info web extracting method that the embodiment of the present invention provides;
Fig. 2 is the schematic flow sheet of the another kind of info web extracting method that the embodiment of the present invention provides;
Fig. 3 is the structural representation of a kind of info web extraction equipment that the embodiment of the present invention provides;
Fig. 4 is the structural representation of the information acquisition unit that the embodiment of the present invention provides;
Fig. 5 is the structural representation of the another kind of info web extraction equipment that the embodiment of the present invention provides.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, be clearly and completely described the technical scheme in the embodiment of the present invention, obviously, described embodiment is only the present invention's part embodiment, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, belong to the scope of protection of the invention.
The info web extracting method that the embodiment of the present invention provides can be applied to the scene that in the webpage of internet, info web extracts, such as: info web extraction equipment obtains webpage source file corresponding to the web page address that inputs, and in described webpage source file, obtain info web characteristic of correspondence descriptor to be extracted, described info web extraction equipment generates the information extraction expression formula be associated with described info web according to described feature interpretation information, described information extraction expression formula is JQuery expression formula, described info web extraction equipment adopts built-in browser to load webpage source file corresponding to described web page address, and call the scene etc. that described information extraction expression formula extracts described info web after described webpage source file has loaded.By the process that the information extraction expression formula adopting built-in browser to perform JQuery is extracted to realize info web, instead of and adopt regular expression to carry out the process extracted, simplify the structure of expression formula, and then reduce the complexity of the rule that definition info web extracts, and reduce cost of development.
The info web extraction equipment that the embodiment of the present invention relates to can include but not limited to that mobile phone, removable computer, panel computer, personal digital assistant (PersonalDigitalAssistant, PDA), intelligent watch etc. possess web page access function at interior subscriber equipment; Described web page address is preferably the URL(uniform resource locator) (UniformResourceLocation, URL) of webpage, and described webpage source file is preferably html file.
Below in conjunction with accompanying drawing 1 and accompanying drawing 2, the info web extracting method that the embodiment of the present invention provides is described in detail.
Refer to Fig. 1, for embodiments providing a kind of schematic flow sheet of info web extracting method.As shown in Figure 1, the described method of the embodiment of the present invention can comprise the following steps S101-step S103.
S101, the webpage source file that the web page address that acquisition inputs is corresponding, and in described webpage source file, obtain info web characteristic of correspondence descriptor to be extracted;
Concrete, info web extraction equipment can obtain webpage source file corresponding to inputted web page address, preferably, the web page address that described info web extraction equipment can adopt system browser to obtain to input is to load webpage source file corresponding to described web page address, and obtaining described webpage source file, described system browser can be the web browser of described info web extraction equipment acquiescence.Described info web extraction equipment can obtain info web characteristic of correspondence descriptor to be extracted in described webpage source file, described info web can for needing the particular content extracted in webpage, described feature interpretation information can at described webpage source file for stating the descriptive language of described info web, be specifically as follows the information including attribute corresponding to described info web to be extracted, label.
S102, generates the information extraction expression formula be associated with described info web according to described feature interpretation information;
Concrete, described info web extraction equipment can according to the attribute, label etc. in described feature interpretation information, generate the information extraction expression formula be associated with described info web, described information extraction expression formula is preferably JQuery expression formula, be understandable that, described information extraction expression formula is used for the expression formula extracted info web in described webpage source file, can be undertaken identifying and calling by built-in browser.
S103, adopts built-in browser to load webpage source file corresponding to described web page address, and calls described information extraction expression formula extract described info web after described webpage source file has loaded;
Concrete, described info web extraction equipment can adopt built-in browser to load webpage source file corresponding to described web page address, preferably, described built-in browser can for calling standardized component kit (StandardWidgetToolkit, SWT) browser that Browser class loads, described info web extraction equipment controls described built-in browser and calls described information extraction expression formula extract described info web after described webpage source file has loaded.
In embodiments of the present invention, the webpage source file that the web page address inputted by acquisition is corresponding, info web characteristic of correspondence descriptor to be extracted is obtained in webpage source file, the information extraction expression formula of the JQuery be associated with info web is generated again according to feature interpretation information, the webpage source file that final employing built-in browser Web page loading address is corresponding, and recalls information extraction expression formula extracts info web after webpage source file has loaded.By the process that the information extraction expression formula adopting built-in browser to perform JQuery is extracted to realize info web, instead of and adopt regular expression to carry out the process extracted, simplify the structure of expression formula, and then reduce the complexity of the rule that definition info web extracts, and reduce cost of development.
Refer to Fig. 2, for embodiments providing the schematic flow sheet of another kind of info web extracting method.As shown in Figure 2, the described method of the embodiment of the present invention can comprise the following steps S201-step S206.
S201, adopts system browser to obtain the web page address that inputs to load webpage source file corresponding to described web page address, and obtains described webpage source file;
Concrete, the web page address that described info web extraction equipment can adopt system browser to obtain to input is to load webpage source file corresponding to described web page address, and obtaining described webpage source file, described system browser can be the web browser of described info web extraction equipment acquiescence.
S202, obtains the feature interpretation information including attribute corresponding to info web to be extracted, label in described webpage source file;
Concrete, described info web extraction equipment can obtain info web characteristic of correspondence descriptor to be extracted in described webpage source file, described info web can for needing the particular content extracted in webpage, comprise: movie name, picture, word fragment etc., described feature interpretation information can be for stating the descriptive language of described info web at described webpage source file, be specifically as follows and include attribute corresponding to described info web to be extracted, the information of label, such as: it is that the title attribute of the A label of nbg is medium that info web is contained in class attribute.
Because nowadays most website adopts asynchronous JavaScript and XML (AsynchronousJavaScriptAndXML, AJAX) technology realizes the exploitation of webpage, therefore the mode of the forgery HTTP request accessed web page address of prior art is adopted cannot to obtain the info web dynamically generated in these websites, the embodiment of the present invention obtains webpage source file based on system browser, the info web dynamically generated in webpage source file can be got, improve the analytical capabilities to webpage source file, ensure that the integrality that info web extracts and accuracy simultaneously.
S203, adopts the program debugging function of system browser to verify described feature interpretation information;
Concrete, described info web extraction equipment can be verified described feature interpretation information, preferably, the program debugging function of system browser can be adopted to verify described feature interpretation information, if the statement of described feature interpretation information is correct, then can determine to be verified.Described info web extraction equipment can proceed to and perform step S204 after passing through described feature interpretation Information Authentication.By verifying feature interpretation information, the accuracy that info web extracts can be ensured further.
S204, generates the information extraction expression formula be associated with described info web after being verified according to described feature interpretation information;
Concrete, described info web extraction equipment can proceed to the step performing and generate the information extraction expression formula be associated with described info web according to described feature interpretation information after passing through described feature interpretation Information Authentication, described info web extraction equipment can according to the attribute in described feature interpretation information, label etc., generate the information extraction expression formula be associated with described info web, described information extraction expression formula is preferably JQuery expression formula, be understandable that, described information extraction expression formula is used for the expression formula extracted info web in described webpage source file, can be undertaken identifying and calling by built-in browser.
S205, adopts built-in browser to load webpage source file corresponding to described web page address, and calls described information extraction expression formula extract described info web after described webpage source file has loaded;
Concrete, described info web extraction equipment can adopt built-in browser to load webpage source file corresponding to described web page address, preferably, the browser that described built-in browser can load for the Browser class calling SWT, described info web extraction equipment controls described built-in browser and calls described information extraction expression formula extract described info web after described webpage source file has loaded.
S206, stores described web page address and described information extraction expression formula;
Concrete, described info web extraction equipment can store described web page address and described information extraction expression formula, by storing web page address and information extraction expression formula, can reuse when required info web is changed, extracting expression formula without the need to repeating information generated, reduce further cost of development and maintenance cost.
In embodiments of the present invention, the webpage source file that the web page address inputted by acquisition is corresponding, info web characteristic of correspondence descriptor to be extracted is obtained in webpage source file, the information extraction expression formula of the JQuery be associated with info web is generated again according to feature interpretation information, the webpage source file that final employing built-in browser Web page loading address is corresponding, and recalls information extraction expression formula extracts info web after webpage source file has loaded.By the process that the information extraction expression formula adopting built-in browser to perform JQuery is extracted to realize info web, instead of and adopt regular expression to carry out the process extracted, simplify the structure of expression formula, and then reduce the complexity of the rule that definition info web extracts, and reduce cost of development; By system browser, webpage source file is obtained, the info web dynamically generated in webpage source file can be got, improve the analytical capabilities to webpage source file, ensure that the integrality that info web extracts and accuracy simultaneously; By verifying feature interpretation information, the accuracy that info web extracts can be ensured further; By storing web page address and information extraction expression formula, can reuse when required info web is changed, extracting expression formula without the need to repeating information generated, reduce further cost of development and maintenance cost.
Below in conjunction with accompanying drawing 3-accompanying drawing 5, the info web extraction equipment that the embodiment of the present invention provides is described in detail.It should be noted that, info web extraction equipment shown in accompanying drawing 3-accompanying drawing 5, for performing Fig. 1 of the present invention and method embodiment illustrated in fig. 2, for convenience of explanation, illustrate only the part relevant to the embodiment of the present invention, concrete ins and outs do not disclose, and please refer to the embodiment shown in Fig. 1 and Fig. 2 of the present invention.
Refer to Fig. 3, for embodiments providing a kind of structural representation of info web extraction equipment.As shown in Figure 3, the described info web extraction equipment 1 of the embodiment of the present invention can comprise: information acquisition unit 11, expression formula generation unit 12 and information extraction unit 13.
Information acquisition unit 11, for obtaining webpage source file corresponding to inputted web page address, and obtains info web characteristic of correspondence descriptor to be extracted in described webpage source file;
In specific implementation, described information acquisition unit 11 can obtain webpage source file corresponding to inputted web page address, preferably, the web page address that described information acquisition unit 11 can adopt system browser to obtain to input is to load webpage source file corresponding to described web page address, and obtaining described webpage source file, described system browser can be the web browser of described info web extraction equipment 1 acquiescence.Described information acquisition unit 11 can obtain info web characteristic of correspondence descriptor to be extracted in described webpage source file, described info web can for needing the particular content extracted in webpage, described feature interpretation information can at described webpage source file for stating the descriptive language of described info web, be specifically as follows the information including attribute corresponding to described info web to be extracted, label.
Concrete, please also refer to Fig. 4, for embodiments providing the structural representation of information acquisition unit.As shown in Figure 4, described information acquisition unit 11 can comprise:
File acquisition subelement 111, obtains the web page address that inputs to load webpage source file corresponding to described web page address, and obtains described webpage source file for adopting system browser;
In specific implementation, the web page address that described file acquisition subelement 111 can adopt system browser to obtain to input is to load webpage source file corresponding to described web page address, and obtaining described webpage source file, described system browser can be the web browser of described info web extraction equipment 1 acquiescence.
Acquisition of information subelement 112, for obtaining the feature interpretation information including attribute corresponding to info web to be extracted, label in described webpage source file;
In specific implementation, described acquisition of information subelement 112 can obtain info web characteristic of correspondence descriptor to be extracted in described webpage source file, described info web can for needing the particular content extracted in webpage, comprise: movie name, picture, word fragment etc., described feature interpretation information can be for stating the descriptive language of described info web at described webpage source file, be specifically as follows and include attribute corresponding to described info web to be extracted, the information of label, such as: it is that the title attribute of the A label of nbg is medium that info web is contained in class attribute.
Because nowadays most website adopts the technology of AJAX to realize the exploitation of webpage, therefore the mode of the forgery HTTP request accessed web page address of prior art is adopted cannot to obtain the info web dynamically generated in these websites, the embodiment of the present invention obtains webpage source file based on system browser, the info web dynamically generated in webpage source file can be got, improve the analytical capabilities to webpage source file, ensure that the integrality that info web extracts and accuracy simultaneously.
Expression formula generation unit 12, for generating the information extraction expression formula be associated with described info web according to described feature interpretation information;
In specific implementation, described expression formula generation unit 12 can according to the attribute, label etc. in described feature interpretation information, generate the information extraction expression formula be associated with described info web, described information extraction expression formula is preferably JQuery expression formula, be understandable that, described information extraction expression formula is used for the expression formula extracted info web in described webpage source file, can be undertaken identifying and calling by built-in browser.
Information extraction unit 13, for adopting built-in browser to load webpage source file corresponding to described web page address, and calls described information extraction expression formula extract described info web after described webpage source file has loaded;
In specific implementation, described information extraction unit 13 can adopt built-in browser to load webpage source file corresponding to described web page address, preferably, the browser that described built-in browser can load for the Browser class calling SWT, described information extraction unit 13 controls described built-in browser and calls described information extraction expression formula extract described info web after described webpage source file has loaded.
In embodiments of the present invention, the webpage source file that the web page address inputted by acquisition is corresponding, info web characteristic of correspondence descriptor to be extracted is obtained in webpage source file, the information extraction expression formula of the JQuery be associated with info web is generated again according to feature interpretation information, the webpage source file that final employing built-in browser Web page loading address is corresponding, and recalls information extraction expression formula extracts info web after webpage source file has loaded.By the process that the information extraction expression formula adopting built-in browser to perform JQuery is extracted to realize info web, instead of and adopt regular expression to carry out the process extracted, simplify the structure of expression formula, and then reduce the complexity of the rule that definition info web extracts, and reduce cost of development; By system browser, webpage source file is obtained, the info web dynamically generated in webpage source file can be got, improve the analytical capabilities to webpage source file, ensure that the integrality that info web extracts and accuracy simultaneously.
Refer to Fig. 5, for embodiments providing the structural representation of another kind of info web extraction equipment.As shown in Figure 5, the described info web extraction equipment 1 of the embodiment of the present invention can comprise: information acquisition unit 11, expression formula generation unit 12, information extraction unit 13, notification unit 14 and storage unit 15; Wherein, the structure of information acquisition unit 11, expression formula generation unit 12 and information extraction unit 13 can the specific descriptions of embodiment shown in Figure 3, do not repeat at this.
Notification unit 14, for adopting the program debugging function of system browser to verify described feature interpretation information, and after being verified, notifying that described expression formula generation unit 12 performs generating according to described feature interpretation information the information extraction expression formula be associated with described info web;
In specific implementation, described notification unit 14 can be verified described feature interpretation information, preferably, the program debugging function of system browser can be adopted to verify described feature interpretation information, if the statement of described feature interpretation information is correct, then can determine to be verified.Described notification unit 14 can notified that described expression formula generation unit 12 performs and generated according to described feature interpretation information the information extraction expression formula be associated with described info web by rear described feature interpretation Information Authentication.By verifying feature interpretation information, the accuracy that info web extracts can be ensured further.
Storage unit 15, for storing described web page address and described information extraction expression formula;
In specific implementation, described storage unit 15 can store described web page address and described information extraction expression formula, by storing web page address and information extraction expression formula, can reuse when required info web is changed, extracting expression formula without the need to repeating information generated, reduce further cost of development and maintenance cost.
In embodiments of the present invention, the webpage source file that the web page address inputted by acquisition is corresponding, info web characteristic of correspondence descriptor to be extracted is obtained in webpage source file, the information extraction expression formula of the JQuery be associated with info web is generated again according to feature interpretation information, the webpage source file that final employing built-in browser Web page loading address is corresponding, and recalls information extraction expression formula extracts info web after webpage source file has loaded.By the process that the information extraction expression formula adopting built-in browser to perform JQuery is extracted to realize info web, instead of and adopt regular expression to carry out the process extracted, simplify the structure of expression formula, and then reduce the complexity of the rule that definition info web extracts, and reduce cost of development; By system browser, webpage source file is obtained, the info web dynamically generated in webpage source file can be got, improve the analytical capabilities to webpage source file, ensure that the integrality that info web extracts and accuracy simultaneously; By verifying feature interpretation information, the accuracy that info web extracts can be ensured further; By storing web page address and information extraction expression formula, can reuse when required info web is changed, extracting expression formula without the need to repeating information generated, reduce further cost of development and maintenance cost.
One of ordinary skill in the art will appreciate that all or part of flow process realized in above-described embodiment method, that the hardware that can carry out instruction relevant by computer program has come, described program can be stored in a computer read/write memory medium, this program, when performing, can comprise the flow process of the embodiment as above-mentioned each side method.Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-OnlyMemory, ROM) or random store-memory body (RandomAccessMemory, RAM) etc.
Above disclosedly be only present pre-ferred embodiments, certainly can not limit the interest field of the present invention with this, therefore according to the equivalent variations that the claims in the present invention are done, still belong to the scope that the present invention is contained.

Claims (10)

1. an info web extracting method, is characterized in that, comprising:
The webpage source file that the web page address that acquisition inputs is corresponding, and in described webpage source file, obtain info web characteristic of correspondence descriptor to be extracted;
Generate the information extraction expression formula be associated with described info web according to described feature interpretation information, described information extraction expression formula is JQuery expression formula;
Adopt built-in browser to load webpage source file corresponding to described web page address, and call described information extraction expression formula extract described info web after described webpage source file has loaded.
2. method according to claim 1, is characterized in that, the webpage source file that the web page address that described acquisition inputs is corresponding, and in described webpage source file, obtain info web characteristic of correspondence descriptor to be extracted, comprising:
Adopt system browser to obtain the web page address that inputs to load webpage source file corresponding to described web page address, and obtain described webpage source file;
The feature interpretation information including attribute corresponding to info web to be extracted, label is obtained in described webpage source file.
3. method according to claim 1 and 2, is characterized in that, described in described webpage source file, obtain info web characteristic of correspondence descriptor to be extracted after, also comprise:
Adopt the program debugging function of system browser to verify described feature interpretation information, and after being verified, performing the step generating the information extraction expression formula be associated with described info web according to described feature interpretation information.
4. method according to claim 1, is characterized in that, the browser that described built-in browser loads for the Browser class calling standardized component kit SWT.
5. method according to claim 1, is characterized in that, also comprises:
Described web page address and described information extraction expression formula are stored.
6. an info web extraction equipment, is characterized in that, comprising:
Information acquisition unit, for obtaining webpage source file corresponding to inputted web page address, and obtains info web characteristic of correspondence descriptor to be extracted in described webpage source file;
Expression formula generation unit, for generating the information extraction expression formula be associated with described info web according to described feature interpretation information, described information extraction expression formula is JQuery expression formula;
Information extraction unit, for adopting built-in browser to load webpage source file corresponding to described web page address, and calls described information extraction expression formula extract described info web after described webpage source file has loaded.
7. equipment according to claim 6, is characterized in that, described information acquisition unit comprises:
File acquisition subelement, obtains the web page address that inputs to load webpage source file corresponding to described web page address, and obtains described webpage source file for adopting system browser;
Acquisition of information subelement, for obtaining the feature interpretation information including attribute corresponding to info web to be extracted, label in described webpage source file.
8. the equipment according to claim 6 or 7, is characterized in that, also comprises:
Notification unit, for adopting the program debugging function of system browser to verify described feature interpretation information, and after being verified, notify that the execution of described expression formula generation unit generates according to described feature interpretation information the information extraction expression formula be associated with described info web.
9. equipment according to claim 6, is characterized in that, the browser that described built-in browser loads for the Browser class calling standardized component kit SWT.
10. equipment according to claim 6, is characterized in that, also comprises:
Storage unit, for storing described web page address and described information extraction expression formula.
CN201510815150.2A 2015-11-20 2015-11-20 Webpage information extraction method and equipment Pending CN105468730A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510815150.2A CN105468730A (en) 2015-11-20 2015-11-20 Webpage information extraction method and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510815150.2A CN105468730A (en) 2015-11-20 2015-11-20 Webpage information extraction method and equipment

Publications (1)

Publication Number Publication Date
CN105468730A true CN105468730A (en) 2016-04-06

Family

ID=55606431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510815150.2A Pending CN105468730A (en) 2015-11-20 2015-11-20 Webpage information extraction method and equipment

Country Status (1)

Country Link
CN (1) CN105468730A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106570133A (en) * 2016-10-27 2017-04-19 任子行网络技术股份有限公司 Method and device for constructing visual webpage information extracting rule
CN106886547A (en) * 2016-07-13 2017-06-23 阿里巴巴集团控股有限公司 A kind of scenario generation method and device

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060041562A1 (en) * 2004-08-19 2006-02-23 Claria Corporation Method and apparatus for responding to end-user request for information-collecting
CN101079031A (en) * 2006-06-15 2007-11-28 腾讯科技(深圳)有限公司 Web page subject extraction system and method
CN101094135A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN103020266A (en) * 2012-12-25 2013-04-03 北京奇虎科技有限公司 Method and device for extracting webpage text content
CN103019925A (en) * 2011-09-26 2013-04-03 阿里巴巴集团控股有限公司 Selector acquisition method and device
CN103258280A (en) * 2012-02-17 2013-08-21 盛趣信息技术(上海)有限公司 Price comparative method and system
CN103714116A (en) * 2013-10-31 2014-04-09 北京奇虎科技有限公司 Webpage information extracting method and webpage information extracting equipment
CN103870606A (en) * 2014-04-08 2014-06-18 上海语天信息技术有限公司 Webpage information extracting system and extracting method
CN104050281A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Webpage information extraction method and device based on http protocol
CN104301381A (en) * 2014-09-01 2015-01-21 江苏西贝电子网络有限公司 Virtual community built based on cloud computing and real geographic information
CN104866489A (en) * 2014-02-24 2015-08-26 赵冰 System for extracting, storing and releasing selected website content

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060041562A1 (en) * 2004-08-19 2006-02-23 Claria Corporation Method and apparatus for responding to end-user request for information-collecting
WO2006023765A2 (en) * 2004-08-19 2006-03-02 Claria, Corporation Method and apparatus for responding to end-user request for information
CN101079031A (en) * 2006-06-15 2007-11-28 腾讯科技(深圳)有限公司 Web page subject extraction system and method
CN101094135A (en) * 2006-06-23 2007-12-26 腾讯科技(深圳)有限公司 Method and system for extracting information of content in Internet
CN101582075A (en) * 2009-06-24 2009-11-18 大连海事大学 Web information extraction system
CN103019925A (en) * 2011-09-26 2013-04-03 阿里巴巴集团控股有限公司 Selector acquisition method and device
CN103258280A (en) * 2012-02-17 2013-08-21 盛趣信息技术(上海)有限公司 Price comparative method and system
CN103020266A (en) * 2012-12-25 2013-04-03 北京奇虎科技有限公司 Method and device for extracting webpage text content
CN103714116A (en) * 2013-10-31 2014-04-09 北京奇虎科技有限公司 Webpage information extracting method and webpage information extracting equipment
CN104866489A (en) * 2014-02-24 2015-08-26 赵冰 System for extracting, storing and releasing selected website content
CN103870606A (en) * 2014-04-08 2014-06-18 上海语天信息技术有限公司 Webpage information extracting system and extracting method
CN104050281A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Webpage information extraction method and device based on http protocol
CN104301381A (en) * 2014-09-01 2015-01-21 江苏西贝电子网络有限公司 Virtual community built based on cloud computing and real geographic information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张彦超 等: "基于自动生成模板的Web信息抽取技术", 《北京交通大学学报》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106886547A (en) * 2016-07-13 2017-06-23 阿里巴巴集团控股有限公司 A kind of scenario generation method and device
WO2018010573A1 (en) * 2016-07-13 2018-01-18 阿里巴巴集团控股有限公司 Method and device for generating script
CN106570133A (en) * 2016-10-27 2017-04-19 任子行网络技术股份有限公司 Method and device for constructing visual webpage information extracting rule
CN106570133B (en) * 2016-10-27 2019-07-23 任子行网络技术股份有限公司 A kind of construction method and device of visual webpage information extracting rule

Similar Documents

Publication Publication Date Title
CN103095681B (en) A kind of method and device detecting leak
WO2017124952A1 (en) Webpage script loading method and device
US8983935B2 (en) Methods for utilizing a javascript emulator in a web content proxy server and devices thereof
CN107808010A (en) A kind of pop-up page generation method, device, browser and storage medium
CN104063401B (en) The method and apparatus that a kind of webpage pattern address merges
CN104036011A (en) Webpage element display method and browser device.
CN102819451A (en) Method and system for calling browser plug-in
CN102999336A (en) Application interface realizing method and application interface realizing system
CN103488482A (en) Method and device for generating test cases
CN102929971A (en) Multimedia information playing method and system
CN104965690A (en) Method and apparatus for processing data
CN104965914A (en) Page display method and apparatus
CN104899203B (en) Webpage generation method and device and terminal equipment
CN105447198A (en) Convenient page script importing method and device
CN105468730A (en) Webpage information extraction method and equipment
CN105095289A (en) Webpage access method and device
CN104731817A (en) Webpage display method and device
US10095791B2 (en) Information search method and apparatus
CN106156291A (en) The caching method of static resource and system thereof based on Localstroage
CN109542404A (en) Construction method, device, storage medium and the electronic equipment of page assembly
CN105243088A (en) Method and apparatus for acquiring webpage content in Android system
CN105354490A (en) Method and device for processing hijacked browser
CN103955548B (en) A kind of webpage rendering intent and device
CN105279076A (en) Webpage test method and terminal
CN104050165A (en) Webpage initial focus selection method and device based on IPTV (interactive personal television)

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20160406