CN110069682A - A kind of internet web page acquisition method - Google Patents

A kind of internet web page acquisition method Download PDF

Info

Publication number
CN110069682A
CN110069682A CN201710822007.5A CN201710822007A CN110069682A CN 110069682 A CN110069682 A CN 110069682A CN 201710822007 A CN201710822007 A CN 201710822007A CN 110069682 A CN110069682 A CN 110069682A
Authority
CN
China
Prior art keywords
positioning
web page
data
internet web
page acquisition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710822007.5A
Other languages
Chinese (zh)
Inventor
梁威
谢宏亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changsha Biovision Software Technology Co Ltd
Original Assignee
Changsha Biovision Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changsha Biovision Software Technology Co Ltd filed Critical Changsha Biovision Software Technology Co Ltd
Priority to CN201710822007.5A priority Critical patent/CN110069682A/en
Publication of CN110069682A publication Critical patent/CN110069682A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The present invention provides a kind of internet web page acquisition methods, including the positioning of webpage target, process to configure, start to acquire, which comprises the steps of: a), background script library method is called to carry out target positioning;B), automated execution script method is called in configuration;C), acquisition data are stored.The present invention can be automation, semi-automated data acquisition, data decryptor, for mobile data calculate and analyze, static data calculate analysis, etc. big data analysis.

Description

A kind of internet web page acquisition method
Technical field
The present invention relates to internet area, in particular to a kind of internet web page acquisition method.
Background technique
Present society is the society of a high speed development, and science and technology is flourishing, information flow, and the exchange between people is increasingly Closely, life is also more and more convenient, and big data is exactly the product of this cyberage.
With the appearance of more and more Internet applications, people are no longer solely focused on the quantity that information is obtained from internet, And become more concerned with the quality of information.User results in based on vertical search engine technology to the new demand of information and core Third generation search engine birth.The method that vertical search engine analyzes webpage is different with second generation search engine from direction. When second generation search engine carries out web page analysis, mainly using keyword extraction as target, without concern for containing for web page contents itself Justice;The information collection target of vertical search engine is domain dependant information, and therefore, it is when analyzing webpage, necessary not only for pass The keyword in webpage is infused, " should be able to also understand " meaning of the web page contents under specific area.It is required that vertical search engine Can have the ability of identification and process field relevant information.Identification and process field relevant information, this is current vertical search The problem that engine faces.
Web retrieval system is mainly used for: data acquisition, script manipulation, brand monitoring, price monitoring, portal website's news Acquisition, industry Zone Information acquisition, competitive intelligence obtain, business data integration, market survey, database marketing, big data analysis etc. Field.Since the information content of different field is possible to quite different with data structure, in addition it is incompatible mutually, therefore data acquire The characteristics of mode should be targeted, can just better adapt to each field in this way.
Summary of the invention
That the purpose of the present invention is to provide a kind of acquisition accuracy is high, operation pages can be configured by script, precisely analysis, Precisely storage, data are shown, application extension etc..Technical solution of the present invention is as follows:
A kind of internet web page acquisition method, including the positioning of webpage target, process configure, start to acquire, which is characterized in that including Following steps:
A), background script library method is called to carry out target positioning;
B), automated execution script method is called in configuration;
C), acquisition data are stored.
Further, in the step a), by the script bank for calling backstage to encapsulate, the positioning to target pages is completed, It includes precise positioning, obscures positioning, the method that integrated positioning etc. obtains precision data.
Further, it in the step b), by calling the script bank of backstage encapsulation, is custom-configured according to user, journey Sequence can execute the operation on the page automatically, and program can be by configuring, and at the appointed time or some period carries out automatically Change operation.
Further, in the step c), by executing background script library method, the data for needing to store is analyzed and are carried out Storage.
Further, in the step c), the storage mode of the storage can be Local or Remote database, can also be with It is stored with file mode, such as Excel, JSON, TXT, the modes such as XML carry out local strange land storage.
Further, the information automatically grabbed is positioned to targeted website, can be text information, pictorial information, table Lattice information, menu information, Tree control information etc..
It further, can be to the acquisition customized classification of information.
It is possible to further pass through Windows task scheduling device, taken at regular intervals target information.
Further, Asynchronous Request can be intercepted, guarantees data accuracy.
Further, data pick-up and merging data between multi-page are supported.
Further, it supports automatic paging to acquire table specify information simultaneously, and supports Multi-meter collecting.
Further, automatic browsing function is supported.
Further, it supports to configure by script to carry out page active operation.
Further, support data format customized
Further, cross-domain after support data acquisition to directly enter database, reduce coupling.
Further, Sybase: SQL Server, Oracle, DB2, MySQL, Sybase, MS Access is supported Deng.
The beneficial effects of the present invention are: user can configure according to custom task, precisely acquire internet mesh in bulk Mark that webpage is semi-structured and unstructured data, be converted into structured record, can be used for outer net publication or it is internal use, realize outer Portion's information quick obtaining, automation and timing manipulation webpage.
Detailed description of the invention
The flow diagram of Fig. 1 the method for the present invention.
Fig. 2 embodiment flow diagram.
Specific embodiment
In order to be easy to understand the technical means, the creative features, the aims and the efficiencies achieved by the present invention, tie below Specific embodiments and the drawings are closed, the present invention is further explained, and however, the following embodiments are merely preferred embodiments of the present invention, not All.Based on the implementation example in the implementation mode, those skilled in the art are obtained without making creative work Other embodiments belong to protection scope of the present invention.
The present embodiment is to be acquired to multiple websites, specific implementation step are as follows:
The first step, by calling the script bank of backstage encapsulation, target pages (website 1, website 2, website 3 etc.) are determined in completion Position comprising precise positioning obscures positioning, integrated positioning;
Second step is custom-configured, program can execute on the page automatically by calling the script bank of backstage encapsulation according to user Operation, program can be by configuring, and at the appointed time or some period carries out automatic operation;
Further, in the step c), the storage mode of the storage can be Local or Remote database, can also be with text Part mode stores, and such as Excel, JSON, TXT, the modes such as XML carry out local strange land storage.
Further, the information automatically grabbed is positioned to targeted website, can be text information, pictorial information, table Lattice information, menu information, Tree control information etc..
It further, can be to the acquisition customized classification of information.
It is possible to further pass through Windows task scheduling device, taken at regular intervals target information.
Further, Asynchronous Request can be intercepted, guarantees data accuracy.
Further, data pick-up and merging data between multi-page are supported.
Further, it supports automatic paging to acquire table specify information simultaneously, and supports Multi-meter collecting.
Further, automatic browsing function is supported.
Further, it supports to configure by script to carry out page active operation.
Further, support data format customized
Further, cross-domain after support data acquisition to directly enter database, reduce coupling.
Further, Sybase: SQL Server, Oracle, DB2, MySQL, Sybase, MS Access is supported Deng, then, by executing background script library method, analyzes and the data stored is needed to be stored, the storage side of the storage Formula can be Local or Remote database, can also be stored with file mode, and such as Excel, JSON, TXT, the modes such as XML are carried out Local strange land storage.

Claims (5)

1. a kind of internet web page acquisition method, including the positioning of webpage target, process configure, start to acquire, which is characterized in that packet Include following steps:
A), background script library method is called to carry out target positioning;
B), automated execution script method is called in configuration;
C), acquisition data are stored.
2. internet web page acquisition method according to claim 1, it is characterised in that: in the step a), pass through calling The script bank of backstage encapsulation, completes the positioning to target pages comprising precise positioning obscures positioning, and integrated positioning etc. obtains The method of precision data.
3. internet web page acquisition method according to claim 1, it is characterised in that: in the step b), pass through calling The script bank of backstage encapsulation, custom-configures, program can execute the operation on the page automatically, and program can be by matching according to user It sets, at the appointed time or some period carries out automatic operation.
4. internet web page acquisition method according to claim 1, it is characterised in that: in the step c), pass through execution Background script library method analyzes and the data stored is needed to be stored.
5. internet web page acquisition method according to claim 4, it is characterised in that: in the step c), the storage Storage mode can be Local or Remote database, can also be stored with file mode, such as Excel, JSON, TXT, XML etc. Mode carries out local strange land storage.
CN201710822007.5A 2017-09-14 2017-09-14 A kind of internet web page acquisition method Pending CN110069682A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710822007.5A CN110069682A (en) 2017-09-14 2017-09-14 A kind of internet web page acquisition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710822007.5A CN110069682A (en) 2017-09-14 2017-09-14 A kind of internet web page acquisition method

Publications (1)

Publication Number Publication Date
CN110069682A true CN110069682A (en) 2019-07-30

Family

ID=67364513

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710822007.5A Pending CN110069682A (en) 2017-09-14 2017-09-14 A kind of internet web page acquisition method

Country Status (1)

Country Link
CN (1) CN110069682A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103455636A (en) * 2013-09-27 2013-12-18 浪潮齐鲁软件产业有限公司 Automatic capturing and intelligent analyzing method based on Internet tax data
CN104750812A (en) * 2015-03-30 2015-07-01 浪潮集团有限公司 Automatic data collecting method based on webpage label analysis
CN106959995A (en) * 2016-12-21 2017-07-18 四川长虹电器股份有限公司 Compatible two-way automatic web page contents acquisition method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103455636A (en) * 2013-09-27 2013-12-18 浪潮齐鲁软件产业有限公司 Automatic capturing and intelligent analyzing method based on Internet tax data
CN104750812A (en) * 2015-03-30 2015-07-01 浪潮集团有限公司 Automatic data collecting method based on webpage label analysis
CN106959995A (en) * 2016-12-21 2017-07-18 四川长虹电器股份有限公司 Compatible two-way automatic web page contents acquisition method

Similar Documents

Publication Publication Date Title
CN109543086B (en) Network data acquisition and display method oriented to multiple data sources
CN102930059B (en) Method for designing focused crawler
Prelipcean et al. MEILI: A travel diary collection, annotation and automation system
CN106126648B (en) It is a kind of based on the distributed merchandise news crawler method redo log
CN108229810B (en) Industry analysis system and method based on network information resources
CN106021583B (en) Statistical method and system for page flow data
CN101676907A (en) Method and system of directionally acquiring Internet resources
CN102542061B (en) Intelligent product classification method
KR101801257B1 (en) Text-Mining Application Technique for Productive Construction Document Management
CN103942335A (en) Construction method of uninterrupted crawler system oriented to web page structure change
CN104391978A (en) Method and device for storing and processing web pages of browsers
CN105468744A (en) Big data platform for realizing tax public opinion analysis and full text retrieval
CN101441629A (en) Automatic acquiring method of non-structured web page information
CN107194007A (en) A kind of integrated management system of spacecraft isomery test data
CN103838796A (en) Webpage structured information extraction method
CN104598536B (en) A kind of distributed network information structuring processing method
CN106547749B (en) Webpage data acquisition method and device
CN106055618A (en) Data processing method based on web crawlers and structural storage
CN102306201A (en) Method and system for analyzing webpage title
CN101819584B (en) Light weight intelligent webpage content analysis method
CN108536700A (en) A kind of method that nothing buries a collector journal
CN103177022A (en) Method and device of malicious file search
CN104598570A (en) Resource fetching method and device
CN103198078B (en) A kind of internet news event report trend analysis and system
CN107086925B (en) Deep learning-based internet traffic big data analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190730

WD01 Invention patent application deemed withdrawn after publication