CN110069682A

CN110069682A - A kind of internet web page acquisition method

Info

Publication number: CN110069682A
Application number: CN201710822007.5A
Authority: CN
Inventors: 梁威; 谢宏亮
Original assignee: Changsha Biovision Software Technology Co Ltd
Current assignee: Changsha Biovision Software Technology Co Ltd
Priority date: 2017-09-14
Filing date: 2017-09-14
Publication date: 2019-07-30

Abstract

The present invention provides a kind of internet web page acquisition methods, including the positioning of webpage target, process to configure, start to acquire, which comprises the steps of: a), background script library method is called to carry out target positioning；B), automated execution script method is called in configuration；C), acquisition data are stored.The present invention can be automation, semi-automated data acquisition, data decryptor, for mobile data calculate and analyze, static data calculate analysis, etc. big data analysis.

Description

A kind of internet web page acquisition method

Technical field

The present invention relates to internet area, in particular to a kind of internet web page acquisition method.

Background technique

Present society is the society of a high speed development, and science and technology is flourishing, information flow, and the exchange between people is increasingly Closely, life is also more and more convenient, and big data is exactly the product of this cyberage.

With the appearance of more and more Internet applications, people are no longer solely focused on the quantity that information is obtained from internet, And become more concerned with the quality of information.User results in based on vertical search engine technology to the new demand of information and core Third generation search engine birth.The method that vertical search engine analyzes webpage is different with second generation search engine from direction. When second generation search engine carries out web page analysis, mainly using keyword extraction as target, without concern for containing for web page contents itself Justice；The information collection target of vertical search engine is domain dependant information, and therefore, it is when analyzing webpage, necessary not only for pass The keyword in webpage is infused, " should be able to also understand " meaning of the web page contents under specific area.It is required that vertical search engine Can have the ability of identification and process field relevant information.Identification and process field relevant information, this is current vertical search The problem that engine faces.

Web retrieval system is mainly used for: data acquisition, script manipulation, brand monitoring, price monitoring, portal website's news Acquisition, industry Zone Information acquisition, competitive intelligence obtain, business data integration, market survey, database marketing, big data analysis etc. Field.Since the information content of different field is possible to quite different with data structure, in addition it is incompatible mutually, therefore data acquire The characteristics of mode should be targeted, can just better adapt to each field in this way.

Summary of the invention

That the purpose of the present invention is to provide a kind of acquisition accuracy is high, operation pages can be configured by script, precisely analysis, Precisely storage, data are shown, application extension etc..Technical solution of the present invention is as follows:

A kind of internet web page acquisition method, including the positioning of webpage target, process configure, start to acquire, which is characterized in that including Following steps:

A), background script library method is called to carry out target positioning；

B), automated execution script method is called in configuration；

C), acquisition data are stored.

Further, in the step a), by the script bank for calling backstage to encapsulate, the positioning to target pages is completed, It includes precise positioning, obscures positioning, the method that integrated positioning etc. obtains precision data.

Further, it in the step b), by calling the script bank of backstage encapsulation, is custom-configured according to user, journey Sequence can execute the operation on the page automatically, and program can be by configuring, and at the appointed time or some period carries out automatically Change operation.

Further, in the step c), by executing background script library method, the data for needing to store is analyzed and are carried out Storage.

Further, in the step c), the storage mode of the storage can be Local or Remote database, can also be with It is stored with file mode, such as Excel, JSON, TXT, the modes such as XML carry out local strange land storage.

Further, the information automatically grabbed is positioned to targeted website, can be text information, pictorial information, table Lattice information, menu information, Tree control information etc..

It further, can be to the acquisition customized classification of information.

It is possible to further pass through Windows task scheduling device, taken at regular intervals target information.

Further, Asynchronous Request can be intercepted, guarantees data accuracy.

Further, data pick-up and merging data between multi-page are supported.

Further, it supports automatic paging to acquire table specify information simultaneously, and supports Multi-meter collecting.

Further, automatic browsing function is supported.

Further, it supports to configure by script to carry out page active operation.

Further, support data format customized

Further, cross-domain after support data acquisition to directly enter database, reduce coupling.

Further, Sybase: SQL Server, Oracle, DB2, MySQL, Sybase, MS Access is supported Deng.

The beneficial effects of the present invention are: user can configure according to custom task, precisely acquire internet mesh in bulk Mark that webpage is semi-structured and unstructured data, be converted into structured record, can be used for outer net publication or it is internal use, realize outer Portion's information quick obtaining, automation and timing manipulation webpage.

Detailed description of the invention

The flow diagram of Fig. 1 the method for the present invention.

Fig. 2 embodiment flow diagram.

Specific embodiment

In order to be easy to understand the technical means, the creative features, the aims and the efficiencies achieved by the present invention, tie below Specific embodiments and the drawings are closed, the present invention is further explained, and however, the following embodiments are merely preferred embodiments of the present invention, not All.Based on the implementation example in the implementation mode, those skilled in the art are obtained without making creative work Other embodiments belong to protection scope of the present invention.

The present embodiment is to be acquired to multiple websites, specific implementation step are as follows:

The first step, by calling the script bank of backstage encapsulation, target pages (website 1, website 2, website 3 etc.) are determined in completion Position comprising precise positioning obscures positioning, integrated positioning；

Second step is custom-configured, program can execute on the page automatically by calling the script bank of backstage encapsulation according to user Operation, program can be by configuring, and at the appointed time or some period carries out automatic operation；

Further, in the step c), the storage mode of the storage can be Local or Remote database, can also be with text Part mode stores, and such as Excel, JSON, TXT, the modes such as XML carry out local strange land storage.

It further, can be to the acquisition customized classification of information.

Further, Asynchronous Request can be intercepted, guarantees data accuracy.

Further, data pick-up and merging data between multi-page are supported.

Further, automatic browsing function is supported.

Further, it supports to configure by script to carry out page active operation.

Further, support data format customized

Further, Sybase: SQL Server, Oracle, DB2, MySQL, Sybase, MS Access is supported Deng, then, by executing background script library method, analyzes and the data stored is needed to be stored, the storage side of the storage Formula can be Local or Remote database, can also be stored with file mode, and such as Excel, JSON, TXT, the modes such as XML are carried out Local strange land storage.

Claims

1. a kind of internet web page acquisition method, including the positioning of webpage target, process configure, start to acquire, which is characterized in that packet Include following steps:

B), automated execution script method is called in configuration；

C), acquisition data are stored.

2. internet web page acquisition method according to claim 1, it is characterised in that: in the step a), pass through calling The script bank of backstage encapsulation, completes the positioning to target pages comprising precise positioning obscures positioning, and integrated positioning etc. obtains The method of precision data.

3. internet web page acquisition method according to claim 1, it is characterised in that: in the step b), pass through calling The script bank of backstage encapsulation, custom-configures, program can execute the operation on the page automatically, and program can be by matching according to user It sets, at the appointed time or some period carries out automatic operation.

4. internet web page acquisition method according to claim 1, it is characterised in that: in the step c), pass through execution Background script library method analyzes and the data stored is needed to be stored.

5. internet web page acquisition method according to claim 4, it is characterised in that: in the step c), the storage Storage mode can be Local or Remote database, can also be stored with file mode, such as Excel, JSON, TXT, XML etc. Mode carries out local strange land storage.