CN104317948A - Page data capturing method and system - Google Patents

Page data capturing method and system Download PDF

Info

Publication number
CN104317948A
CN104317948A CN201410635960.5A CN201410635960A CN104317948A CN 104317948 A CN104317948 A CN 104317948A CN 201410635960 A CN201410635960 A CN 201410635960A CN 104317948 A CN104317948 A CN 104317948A
Authority
CN
China
Prior art keywords
target pages
data
configuration information
page data
text data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410635960.5A
Other languages
Chinese (zh)
Inventor
刘旭辉
任继成
高照
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING ZHONGKE FULONG INFORMATION TECHNOLOGY Co Ltd
Original Assignee
BEIJING ZHONGKE FULONG INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING ZHONGKE FULONG INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING ZHONGKE FULONG INFORMATION TECHNOLOGY Co Ltd
Priority to CN201410635960.5A priority Critical patent/CN104317948A/en
Publication of CN104317948A publication Critical patent/CN104317948A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to a page data capturing method and a page data capturing system. The method comprises the steps of: S1, analyzing a target page to obtain configuration information of the target page, and generating a matching template according to the configuration information; S2, obtaining address information of the target page from the configuration information, determining the target page according to the address information, and obtaining text data in the target page; S3, capturing the text data in the target page according to the matching template by a capturing unit, and storing the text data as a basis for index operation. According to the page data capturing method and system, the capturing unit can quickly adapt to the pages of various websites, and specific areas and/or data in the target page can be accurately captured.

Description

Page data grasping means and system
Technical field
The present invention relates to technical field of data processing, in particular to a kind of page data grasping means and a kind of page data grasping system.
Background technology
By mode the most easily, obtaining the most effective information is the target that people pursue all the time.Therefore, simple and reliable, stable performance is the peak demand of the collection of programming personnel's design information and searching system.Along with the fast development of global IT application, internet creates a large amount of webpages, the Issues about Information Retrieval that user faces magnanimity webpage directly that appears as of traditional search engines (Google, Baidu etc.) provides solution route.But what traditional search engines was focused on is the range of information retrieval, be difficult to meet that user is more and more personalized, specialized Search Requirement, therefore, vertically retrieve arise at the historic moment with the enterprise-level retrieval in special-purpose or field, theme.The Internet resources that specific or user thinks most interested only searched for by personalized search engine, can better for user provides convenient, efficient retrieval service, become the development trend that of modern information retrieval field is important gradually.
Search engine has three zones in general: reptile, the crawl of primary responsibility web data, for search engine provides the data source header of retrieval; Index, in order to improve recall precision, the inverted file in units of word set up after the web data participle that reptile gathers; Rank, according to after the inquiry of user's input and index database coupling according to certain ordering rule feed back to the result for retrieval sequence of user.As can be seen here, reptile gather the data source header that the webpage of returning is search engine, the retrieval effectiveness of its quality to search engine has a great impact.Individual searching engine compares universal search engine, maximum difference is exactly crawler system, tradition reptile covers by maximized Internet resources the general Search Requirement meeting a large number of users, adopt the breadth first traversal mode being similar to digraph to capture Internet resources, emphasis be the range of information acquisition.And individual searching engine target is the web information capturing most worthy with minimum reptile Service Source, and maximize to obtain filtering useless information, for user provides information very accurately, its reptile module is called " vertical reptile " for the time being.So-called vertical, be relative to comprehensive search Engine-Network reptile module Horizon Search, contain much information, inquire about not accurately, the degree of depth not for.The difference of vertical reptile and general reptile has mainly carried out structuring information extraction to info web, namely the unstructured data of webpage is taken into specific structured message data, its feature is exactly " specially, smart, dark ", and there is industry color, to compare the magnanimity information disordering of comprehensive search engine, vertical reptile then seem more be absorbed in, concrete and deeply.
As mentioned above, the target of individual searching engine in order to make search effect reach " specially, smart, dark ", its reptile module must capture the concrete column of all types of website or forum accurately, the even article list of certain class theme, the content as far as possible avoided or reduce unvested theme appears in Search Results, to cause bad Consumer's Experience.For this reason, perhaps compile and edit system for the website of every type to develop a set of reptile and do not lose a solution, but, growing along with internet sites, the continuous change of Website page content, such mode certainly will bring huge hard work amount to developer and maintainer, and recall precision also can not be guaranteed.
Summary of the invention
Technical matters to be solved by this invention is, how to make placement unit can be applicable to the page of various website fast, and accurately can capture specific region and/or data in target pages.
For this purpose, the present invention proposes a kind of page data grasping means, comprising: S1, resolve the configuration information that target pages obtains described target pages, generate matching template according to described configuration information; S2, obtains the address information of described target pages from described configuration information, determines described target pages according to described address information, obtains the text data in described target pages; S3, placement unit captures text data according to described matching template in described target pages, stores the basis of described text data as index operation.
Preferably, described step S1 also comprises: be loaded into by described configuration information in the static memory of described placement unit.
Preferably, described step S2 also comprises: described address information inserted address queue from tail of the queue, wherein, described address queue is managed by singleton pattern.
Preferably, described step S2 also comprises: filter the data do not conformed to preset data type in described text data.
Preferably, also comprised before described step S1: the complexity judging described target pages, when described complexity is less than or equal to preset value, resolve described target pages by regular expression, when described complexity is greater than described preset value, by target pages described in jsoup Analysis on Framework.
The invention allows for a kind of page data grasping system, comprising: resolution unit, obtaining the configuration information of described target pages for resolving target pages, generate matching template according to described configuration information; Acquiring unit, for obtaining the address information of described target pages from described configuration information, determining described target pages according to described address information, obtaining the text data in described target pages; Placement unit, for capturing text data according to described matching template in described target pages, stores the basis of described text data as index operation.
Preferably, also comprise: loading unit, for described configuration information is loaded in the static memory of described placement unit.
Preferably, also comprise: queue management unit, for described address information being inserted address queue from tail of the queue, wherein, described address queue is managed by singleton pattern.
Preferably, also comprise: filter element, for filtering the data do not conformed to preset data type in described text data.
Preferably, also comprise: judging unit, for judging the complexity of described target pages, wherein, described resolution unit is when described complexity is less than or equal to preset value, described target pages is resolved, when described complexity is greater than described preset value, by target pages described in jsoup Analysis on Framework by regular expression.
Pass through technique scheme, the present invention is by resolving target pages, for target pages makes a set of matching template to measure, placement unit just can according to the configuration information of template, region in target pages and/or data are captured accurately, if website or webpage there occurs change, only need to revise template, need not go again and revise placement unit, and the workload person revising template is less than amendment placement unit, both improve work efficiency like this, can shorten again the cycle that placement unit captures data, the content that placement unit is provided for index is upgraded in time.
Accompanying drawing explanation
Can understanding the features and advantages of the present invention clearly by reference to accompanying drawing, accompanying drawing is schematic and should not be construed as and carry out any restriction to the present invention, in the accompanying drawings:
Fig. 1 shows the schematic flow diagram of page data grasping means according to an embodiment of the invention;
Fig. 2 shows the schematic block diagram of page data grasping means according to an embodiment of the invention;
Fig. 3 shows source code according to an embodiment of the invention and splits schematic diagram;
Fig. 4 shows the schematic flow diagram of page data grasping means in accordance with another embodiment of the present invention;
Fig. 5 shows the particular flow sheet of grasping manipulation in accordance with another embodiment of the present invention.
Embodiment
Can more clearly understand above-mentioned purpose of the present invention, feature and advantage, below in conjunction with the drawings and specific embodiments, the present invention is further described in detail.It should be noted that, when not conflicting, the feature in the embodiment of the application and embodiment can combine mutually.
Set forth a lot of detail in the following description so that fully understand the present invention; but; the present invention can also adopt other to be different from other modes described here and implement, and therefore, protection scope of the present invention is not by the restriction of following public specific embodiment.
As shown in Figure 1, page data grasping means comprises according to an embodiment of the invention: S1, resolves the configuration information that target pages obtains target pages, generates matching template according to configuration information; S2, obtains the address information of target pages from configuration information, according to address information determination target pages, obtains the text data in target pages; S3, placement unit captures text data according to matching template in target pages, stores the basis of text data as index operation.
By resolving target pages (such as webpage), the configuration information of target pages can be obtained, by configuration information in the form of a label stored in XML file, so that placement unit analysis, in one embodiment of the invention, configuration information is stored in as shown in table 1 after XML file:
Table 1
Obtaining in configuration information, at least comprising the address information (such as URL link) of the page, text message and structural information (such as column link, paging lists of links etc.), can matching template be generated according to structural information wherein.Then from configuration information, obtain the address information of target pages, according to address information determination target pages, obtain the text data in target pages; Finally in text data, capture target data according to matching template by placement unit.Preferably, a queue can being set up for the grasping manipulation of placement unit, for for resolving, storing and the operation such as crawl provides the Row control of a first in first out, to prevent number according to omitting or the problem of repeated storage, and ensureing the unitarity of data.
The text data (such as title, issuing time, body matter etc.) that placement unit grabs can be stored in database and (can carry out Organization of Data with the row place of database table or local file).Because placement unit is the data captured according to matching template, data grabber can be carried out for the page meeting this matching template, and without the need to revising the program in placement unit for the different pages (even not same column), there is multiple sub-channel in the economic channel of such as certain website, the structural information of each channel is identical (or similar), so can generate matching template according to the structural information of one of them channel, make placement unit can capture text data in every sub-channel according to this matching template, and without the need to updating placement unit according to every sub-channel, make a placement unit all applicable for even several website, whole website, save in system in order to store multiple placement unit the space reserved.
And use matching template to carry out capturing the cycle significantly can reducing program development, improve the work efficiency of program research staff.More meet the requirements according to the data that matching template carries out capturing, for index provides foundation more accurately.And when target pages changes, only need to change template, and without the need to updating placement unit, maintenance cost is reduced.
Preferably, step S1 also comprises: be loaded in the static memory of placement unit by configuration information.Configuration information in table 1 can be loaded in the static memory of placement unit, such as, store with Map<code, PageStruct> form, instead of leave lane database in.By this storage mode, placement unit can be made in frequent coupling and only need take out webpage configuration information just from the static memory of container when capturing page data, need not read from database again, namely improve and capture efficiency, the I/O pressure of the database also reduced
Preferably, step S2 also comprises: address information inserted address queue from tail of the queue, wherein, address queue is managed by singleton pattern.Can according to the news tag link address of configuration information by page appointed area, the address that in the description that such as, in table 1 in label, captureType is corresponding, href links, these labels stored in one with singleton pattern management queue in, can insert from tail of the queue, to ensure that capturing task only has a queue example at every turn, and when Context resolution is carried out in team's head taking-up link, guarantee that multiple thread can synchronous working, indicate if having paging to configure or the follow-up page number can be taken out, continue address information to insert queue.Queue processing flow process concrete in crawl process is as Fig. 5, and data flow as shown in Figure 4.
Preferably, step S2 also comprises: filter the data do not conformed to preset data type in text data.Before carrying out data grabber, first the data filtering do not conformed to preset data type in text data can be fallen, thus reduce the data volume analyzed described in placement unit, improve and capture efficiency.
Preferably, also comprised before step S1: the complexity judging target pages, when complexity is less than or equal to preset value, resolve target pages by regular expression, when complexity is greater than preset value, by jsoup Analysis on Framework target pages.Parsing in the present invention can be adopted in two ways, adopt regular expression to resolve when page complexity is lower, by the relative position of the webpage Html structure represented by webpage configuration file and label, carry out the particular content of analyzing web page, because regular expression is more abstract and operational efficiency is lower, be applicable to webpage relatively simple for structure.For the also difficulty that complexity is higher, the present invention adopts the web analysis framework of jsoup to resolve, wherein, jsoup is the html parser of a Java, directly can resolve certain URL address, html text content, jsoup framework provides a set of API (application programming interface) very efficiently, by DOM (a kind of standard programming interface processing easily extensible markup language), CSS (application of standard generalized markup language) and the method for operating being similar to jQuery (a kind of Java framework) are taken out and service data.
Use the mode of regular expression in page parsing, the content of pages analysis of sing on web file structure is as one embodiment of the present of invention, can first delete useless mark, analyze the page again, add end mark to only having initial mark to guarantee that matched indicia is correct, and employing HTML indicia distribution rule carries out analyzing web page content.Here be one after treatment, simple page source code structure example:
Source code wherein can be split as structure shown in Fig. 3.Based on focusing on the file structure parameter extraction Web page text content in describing, then matching regular expressions can be used, obtaining the data such as lower one page URL address.
For the page target region that the page or be difficult to of complicated structure describes with regular expression, the present invention is configured target pages by template, and jsoup framework just can to target-region locating and reading, and then extraction data wherein.For strengthen placement unit universality with and shorten user's obtaining information time.Use the benefit of template to be according to the feature of website, write out the general placement unit program being applicable to this website, and template can to reduce the complexity of user operation, user just can use placement unit without the need to providing a large amount of information.
The invention allows for a kind of page data grasping system 10, comprising: resolution unit 11, obtaining the configuration information of target pages for resolving target pages, generate matching template according to configuration information; Acquiring unit 12, for obtaining the address information of target pages from configuration information, according to address information determination target pages, obtains the text data in target pages; Placement unit 13, for capturing text data according to matching template in target pages, stores the basis of text data as index operation.
Preferably, also comprise: loading unit 14, for configuration information is loaded in the static memory of placement unit 13.
Preferably, also comprise: queue management unit 15, for address information being inserted address queue from tail of the queue, wherein, address queue is managed by singleton pattern.
Preferably, also comprise: filter element 16, for filtering the data do not conformed to preset data type in text data.
Preferably, also comprise: judging unit 17, for judging the complexity of target pages, wherein, resolution unit 11, when complexity is less than or equal to preset value, resolves target pages by regular expression, when complexity is greater than preset value, by jsoup Analysis on Framework target pages.
According to the embodiment of the present invention, additionally provide a kind of non-volatile machine readable media, store the program product captured for page data, program product comprises above-mentioned two kinds of program products.
According to the embodiment of the present invention, additionally provide a kind of machine readable program, program makes machine perform page data grasping means arbitrary in as above technical scheme.
According to the embodiment of the present invention, additionally provide a kind of storage medium storing machine readable program, wherein, machine readable program makes machine perform page data grasping means arbitrary in as above technical scheme.
More than be described with reference to the accompanying drawings technical scheme of the present invention, consider in correlation technique, when capturing data for the different pages, need to revise capture program, need to pay very large programing work amount, and also cannot very exactly for index provides foundation by means of only the data of capture program crawl.By the technical scheme of the application, a set of matching template can be made to measure for target pages, placement unit just can according to the configuration information of template, region in target pages and/or data are captured accurately, if website or webpage there occurs change, only need to revise template, need not go again and revise placement unit, and the workload person revising template is less than amendment placement unit, both improve work efficiency like this, can shorten again the cycle that placement unit captures data, the content that placement unit is provided for index is upgraded in time.
In the present invention, term " multiple " refers to two or more, unless otherwise clear and definite restriction.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. a page data grasping means, is characterized in that, comprising:
S1, resolves the configuration information that target pages obtains described target pages, generates matching template according to described configuration information;
S2, obtains the address information of described target pages from described configuration information, determines described target pages according to described address information, obtains the text data in described target pages;
S3, placement unit captures text data according to described matching template in described target pages, stores the basis of described text data as index operation.
2. page data grasping means according to claim 1, is characterized in that, described step S1 also comprises: be loaded into by described configuration information in the static memory of described placement unit.
3. page data grasping means according to claim 1, is characterized in that, described step S2 also comprises: described address information inserted address queue from tail of the queue, wherein, described address queue is managed by singleton pattern.
4. page data grasping means according to any one of claims 1 to 3, is characterized in that, described step S2 also comprises: filter the data do not conformed to preset data type in described text data.
5. page data grasping means according to any one of claims 1 to 3, it is characterized in that, also comprised before described step S1: the complexity judging described target pages, when described complexity is less than or equal to preset value, described target pages is resolved by regular expression, when described complexity is greater than described preset value, by target pages described in jsoup Analysis on Framework.
6. a page data grasping system, is characterized in that, comprising:
Resolution unit, obtains the configuration information of described target pages for resolving target pages, generate matching template according to described configuration information;
Acquiring unit, for obtaining the address information of described target pages from described configuration information, determining described target pages according to described address information, obtaining the text data in described target pages;
Placement unit, for capturing text data according to described matching template in described target pages, stores the basis of described text data as index operation.
7. page data grasping system according to claim 6, is characterized in that, also comprise:
Loading unit, for being loaded into described configuration information in the static memory of described placement unit.
8. page data grasping system according to claim 6, is characterized in that, also comprise:
Queue management unit, for described address information being inserted address queue from tail of the queue, wherein, described address queue is managed by singleton pattern.
9. page data grasping system according to any one of claim 6 to 8, is characterized in that, also comprise:
Filter element, for filtering the data do not conformed to preset data type in described text data.
10. page data grasping system according to any one of claim 6 to 8, is characterized in that, also comprise:
Judging unit, for judging the complexity of described target pages,
Wherein, described resolution unit, when described complexity is less than or equal to preset value, resolves described target pages by regular expression, when described complexity is greater than described preset value, by target pages described in jsoup Analysis on Framework.
CN201410635960.5A 2014-11-05 2014-11-05 Page data capturing method and system Pending CN104317948A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410635960.5A CN104317948A (en) 2014-11-05 2014-11-05 Page data capturing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410635960.5A CN104317948A (en) 2014-11-05 2014-11-05 Page data capturing method and system

Publications (1)

Publication Number Publication Date
CN104317948A true CN104317948A (en) 2015-01-28

Family

ID=52373180

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410635960.5A Pending CN104317948A (en) 2014-11-05 2014-11-05 Page data capturing method and system

Country Status (1)

Country Link
CN (1) CN104317948A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160209A (en) * 2015-08-31 2015-12-16 佛山市恒南微科技有限公司 System for investigating and managing regional enterprise software copyright announcement
CN105447139A (en) * 2015-11-20 2016-03-30 广州华多网络科技有限公司 Data acquisition statistical method, and system, terminal and service equipment thereof
CN105740363A (en) * 2016-01-26 2016-07-06 上海晶赞科技发展有限公司 Website target page discovery method and apparatus
CN106095984A (en) * 2016-06-20 2016-11-09 乐视控股(北京)有限公司 A kind of method and device obtaining structural data
CN106649371A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data processing method and device for crawlers
CN106649357A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data processing method and apparatus used for crawler program
CN106991144A (en) * 2017-03-22 2017-07-28 山东大学 It is a kind of to customize the method and system that data crawl workflow
CN107025296A (en) * 2017-04-17 2017-08-08 山东辰华科技信息有限公司 Based on science service information intelligent grasping system method of data capture
CN108052517A (en) * 2017-10-19 2018-05-18 福建中金在线信息科技有限公司 Data search method and system
CN108121751A (en) * 2016-11-30 2018-06-05 北京国双科技有限公司 The method and apparatus of web page crawl
CN111898012A (en) * 2020-07-23 2020-11-06 昆山领创信息科技有限公司 Automatic packet grabbing method for WEB application
CN112749975A (en) * 2019-10-31 2021-05-04 中国移动通信集团浙江有限公司 Method for automatically processing refund request and automatic processing platform
CN113934913A (en) * 2021-11-12 2022-01-14 盐城金堤科技有限公司 Data capture method and device, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957816A (en) * 2009-07-13 2011-01-26 上海谐宇网络科技有限公司 Webpage metadata automatic extraction method and system based on multi-page comparison
CN103136358A (en) * 2013-03-07 2013-06-05 宁波成电泰克电子信息技术发展有限公司 Method for automatically extracting BBS (bulletin board system) data
CN103886078A (en) * 2014-03-25 2014-06-25 烟台中科网络技术研究所 Universal news comment collection method and device
CN104050281A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Webpage information extraction method and device based on http protocol

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101957816A (en) * 2009-07-13 2011-01-26 上海谐宇网络科技有限公司 Webpage metadata automatic extraction method and system based on multi-page comparison
CN103136358A (en) * 2013-03-07 2013-06-05 宁波成电泰克电子信息技术发展有限公司 Method for automatically extracting BBS (bulletin board system) data
CN103886078A (en) * 2014-03-25 2014-06-25 烟台中科网络技术研究所 Universal news comment collection method and device
CN104050281A (en) * 2014-06-26 2014-09-17 北京思特奇信息技术股份有限公司 Webpage information extraction method and device based on http protocol

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王乐超: "Web环境下文献信息的提取与匹配研究", 《中国优秀硕士学位论文全文数据库》 *
胡少荣等: "网页信息自动抽取技术的研究", 《铁路计算机应用》 *

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105160209A (en) * 2015-08-31 2015-12-16 佛山市恒南微科技有限公司 System for investigating and managing regional enterprise software copyright announcement
CN106649371A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data processing method and device for crawlers
CN106649357A (en) * 2015-10-30 2017-05-10 北京国双科技有限公司 Data processing method and apparatus used for crawler program
CN105447139A (en) * 2015-11-20 2016-03-30 广州华多网络科技有限公司 Data acquisition statistical method, and system, terminal and service equipment thereof
CN105447139B (en) * 2015-11-20 2021-05-11 广州华多网络科技有限公司 Data acquisition statistical method and system, terminal and service equipment thereof
CN105740363A (en) * 2016-01-26 2016-07-06 上海晶赞科技发展有限公司 Website target page discovery method and apparatus
CN106095984A (en) * 2016-06-20 2016-11-09 乐视控股(北京)有限公司 A kind of method and device obtaining structural data
CN108121751A (en) * 2016-11-30 2018-06-05 北京国双科技有限公司 The method and apparatus of web page crawl
CN106991144B (en) * 2017-03-22 2021-01-29 山东大学 Method and system for customizing data crawling workflow
CN106991144A (en) * 2017-03-22 2017-07-28 山东大学 It is a kind of to customize the method and system that data crawl workflow
CN107025296B (en) * 2017-04-17 2018-11-06 山东辰华科技信息有限公司 Based on science service information intelligent grasping system method of data capture
CN107025296A (en) * 2017-04-17 2017-08-08 山东辰华科技信息有限公司 Based on science service information intelligent grasping system method of data capture
CN108052517A (en) * 2017-10-19 2018-05-18 福建中金在线信息科技有限公司 Data search method and system
CN112749975A (en) * 2019-10-31 2021-05-04 中国移动通信集团浙江有限公司 Method for automatically processing refund request and automatic processing platform
CN112749975B (en) * 2019-10-31 2024-03-22 中国移动通信集团浙江有限公司 Method for automatically processing refund request and automatic processing platform
CN111898012A (en) * 2020-07-23 2020-11-06 昆山领创信息科技有限公司 Automatic packet grabbing method for WEB application
CN113934913A (en) * 2021-11-12 2022-01-14 盐城金堤科技有限公司 Data capture method and device, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN104317948A (en) Page data capturing method and system
Liu et al. Vide: A vision-based approach for deep web data extraction
CN101957816B (en) Webpage metadata automatic extraction method and system based on multi-page comparison
CN101329687B (en) Method for positioning news web page
CN106126648B (en) It is a kind of based on the distributed merchandise news crawler method redo log
CN104142985A (en) Semi-automatic vertical crawler generation tool and method
US20190235721A1 (en) Flexible content organization and retrieval
CN103838796A (en) Webpage structured information extraction method
CN106021583A (en) Statistical method and system for page flow data
US20150269138A1 (en) Publication Scope Visualization and Analysis
CN104391978A (en) Method and device for storing and processing web pages of browsers
Rath et al. The SEOSS 33 dataset—Requirements, bug reports, code history, and trace links for entire projects
KR20170115109A (en) Text-Mining Application Technique for Productive Construction Document Management
CN103853770B (en) The method and system of model content in a kind of extraction forum Web pages
CN110020068B (en) Method and device for configuring page crawling rules
CN103365961A (en) Accurate search-oriented website structurization labeling method and system
CN103488675A (en) Automatic precise extraction device for multi-webpage news comment contents
Yatskov et al. Extraction of data from mass media web sites
CN101639840A (en) Method and device for identifying semantic structure of network information
Liu et al. Automatically extracting user reviews from forum sites
Wanjari et al. Automatic news extraction system for Indian online news papers
Tang et al. Regular expression-based reference metadata extraction from the web
CN109948015B (en) Meta search list result extraction method and system
CN104281693A (en) Semantic search method and semantic search system
CN113407803A (en) Method for acquiring internet data in one step

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20150128

RJ01 Rejection of invention patent application after publication