CN110069682A - A kind of internet web page acquisition method - Google Patents
A kind of internet web page acquisition method Download PDFInfo
- Publication number
- CN110069682A CN110069682A CN201710822007.5A CN201710822007A CN110069682A CN 110069682 A CN110069682 A CN 110069682A CN 201710822007 A CN201710822007 A CN 201710822007A CN 110069682 A CN110069682 A CN 110069682A
- Authority
- CN
- China
- Prior art keywords
- positioning
- web page
- data
- internet web
- page acquisition
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000005538 encapsulation Methods 0.000 claims description 5
- 238000004458 analytical method Methods 0.000 abstract description 3
- 238000007405 data analysis Methods 0.000 abstract description 2
- 230000003068 static effect Effects 0.000 abstract 1
- 230000008878 coupling Effects 0.000 description 2
- 238000010168 coupling process Methods 0.000 description 2
- 238000005859 coupling reaction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002860 competitive effect Effects 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/13—File access structures, e.g. distributed indices
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The present invention provides a kind of internet web page acquisition methods, including the positioning of webpage target, process to configure, start to acquire, which comprises the steps of: a), background script library method is called to carry out target positioning;B), automated execution script method is called in configuration;C), acquisition data are stored.The present invention can be automation, semi-automated data acquisition, data decryptor, for mobile data calculate and analyze, static data calculate analysis, etc. big data analysis.
Description
Technical field
The present invention relates to internet area, in particular to a kind of internet web page acquisition method.
Background technique
Present society is the society of a high speed development, and science and technology is flourishing, information flow, and the exchange between people is increasingly
Closely, life is also more and more convenient, and big data is exactly the product of this cyberage.
With the appearance of more and more Internet applications, people are no longer solely focused on the quantity that information is obtained from internet,
And become more concerned with the quality of information.User results in based on vertical search engine technology to the new demand of information and core
Third generation search engine birth.The method that vertical search engine analyzes webpage is different with second generation search engine from direction.
When second generation search engine carries out web page analysis, mainly using keyword extraction as target, without concern for containing for web page contents itself
Justice;The information collection target of vertical search engine is domain dependant information, and therefore, it is when analyzing webpage, necessary not only for pass
The keyword in webpage is infused, " should be able to also understand " meaning of the web page contents under specific area.It is required that vertical search engine
Can have the ability of identification and process field relevant information.Identification and process field relevant information, this is current vertical search
The problem that engine faces.
Web retrieval system is mainly used for: data acquisition, script manipulation, brand monitoring, price monitoring, portal website's news
Acquisition, industry Zone Information acquisition, competitive intelligence obtain, business data integration, market survey, database marketing, big data analysis etc.
Field.Since the information content of different field is possible to quite different with data structure, in addition it is incompatible mutually, therefore data acquire
The characteristics of mode should be targeted, can just better adapt to each field in this way.
Summary of the invention
That the purpose of the present invention is to provide a kind of acquisition accuracy is high, operation pages can be configured by script, precisely analysis,
Precisely storage, data are shown, application extension etc..Technical solution of the present invention is as follows:
A kind of internet web page acquisition method, including the positioning of webpage target, process configure, start to acquire, which is characterized in that including
Following steps:
A), background script library method is called to carry out target positioning;
B), automated execution script method is called in configuration;
C), acquisition data are stored.
Further, in the step a), by the script bank for calling backstage to encapsulate, the positioning to target pages is completed,
It includes precise positioning, obscures positioning, the method that integrated positioning etc. obtains precision data.
Further, it in the step b), by calling the script bank of backstage encapsulation, is custom-configured according to user, journey
Sequence can execute the operation on the page automatically, and program can be by configuring, and at the appointed time or some period carries out automatically
Change operation.
Further, in the step c), by executing background script library method, the data for needing to store is analyzed and are carried out
Storage.
Further, in the step c), the storage mode of the storage can be Local or Remote database, can also be with
It is stored with file mode, such as Excel, JSON, TXT, the modes such as XML carry out local strange land storage.
Further, the information automatically grabbed is positioned to targeted website, can be text information, pictorial information, table
Lattice information, menu information, Tree control information etc..
It further, can be to the acquisition customized classification of information.
It is possible to further pass through Windows task scheduling device, taken at regular intervals target information.
Further, Asynchronous Request can be intercepted, guarantees data accuracy.
Further, data pick-up and merging data between multi-page are supported.
Further, it supports automatic paging to acquire table specify information simultaneously, and supports Multi-meter collecting.
Further, automatic browsing function is supported.
Further, it supports to configure by script to carry out page active operation.
Further, support data format customized
Further, cross-domain after support data acquisition to directly enter database, reduce coupling.
Further, Sybase: SQL Server, Oracle, DB2, MySQL, Sybase, MS Access is supported
Deng.
The beneficial effects of the present invention are: user can configure according to custom task, precisely acquire internet mesh in bulk
Mark that webpage is semi-structured and unstructured data, be converted into structured record, can be used for outer net publication or it is internal use, realize outer
Portion's information quick obtaining, automation and timing manipulation webpage.
Detailed description of the invention
The flow diagram of Fig. 1 the method for the present invention.
Fig. 2 embodiment flow diagram.
Specific embodiment
In order to be easy to understand the technical means, the creative features, the aims and the efficiencies achieved by the present invention, tie below
Specific embodiments and the drawings are closed, the present invention is further explained, and however, the following embodiments are merely preferred embodiments of the present invention, not
All.Based on the implementation example in the implementation mode, those skilled in the art are obtained without making creative work
Other embodiments belong to protection scope of the present invention.
The present embodiment is to be acquired to multiple websites, specific implementation step are as follows:
The first step, by calling the script bank of backstage encapsulation, target pages (website 1, website 2, website 3 etc.) are determined in completion
Position comprising precise positioning obscures positioning, integrated positioning;
Second step is custom-configured, program can execute on the page automatically by calling the script bank of backstage encapsulation according to user
Operation, program can be by configuring, and at the appointed time or some period carries out automatic operation;
Further, in the step c), the storage mode of the storage can be Local or Remote database, can also be with text
Part mode stores, and such as Excel, JSON, TXT, the modes such as XML carry out local strange land storage.
Further, the information automatically grabbed is positioned to targeted website, can be text information, pictorial information, table
Lattice information, menu information, Tree control information etc..
It further, can be to the acquisition customized classification of information.
It is possible to further pass through Windows task scheduling device, taken at regular intervals target information.
Further, Asynchronous Request can be intercepted, guarantees data accuracy.
Further, data pick-up and merging data between multi-page are supported.
Further, it supports automatic paging to acquire table specify information simultaneously, and supports Multi-meter collecting.
Further, automatic browsing function is supported.
Further, it supports to configure by script to carry out page active operation.
Further, support data format customized
Further, cross-domain after support data acquisition to directly enter database, reduce coupling.
Further, Sybase: SQL Server, Oracle, DB2, MySQL, Sybase, MS Access is supported
Deng, then, by executing background script library method, analyzes and the data stored is needed to be stored, the storage side of the storage
Formula can be Local or Remote database, can also be stored with file mode, and such as Excel, JSON, TXT, the modes such as XML are carried out
Local strange land storage.
Claims (5)
1. a kind of internet web page acquisition method, including the positioning of webpage target, process configure, start to acquire, which is characterized in that packet
Include following steps:
A), background script library method is called to carry out target positioning;
B), automated execution script method is called in configuration;
C), acquisition data are stored.
2. internet web page acquisition method according to claim 1, it is characterised in that: in the step a), pass through calling
The script bank of backstage encapsulation, completes the positioning to target pages comprising precise positioning obscures positioning, and integrated positioning etc. obtains
The method of precision data.
3. internet web page acquisition method according to claim 1, it is characterised in that: in the step b), pass through calling
The script bank of backstage encapsulation, custom-configures, program can execute the operation on the page automatically, and program can be by matching according to user
It sets, at the appointed time or some period carries out automatic operation.
4. internet web page acquisition method according to claim 1, it is characterised in that: in the step c), pass through execution
Background script library method analyzes and the data stored is needed to be stored.
5. internet web page acquisition method according to claim 4, it is characterised in that: in the step c), the storage
Storage mode can be Local or Remote database, can also be stored with file mode, such as Excel, JSON, TXT, XML etc.
Mode carries out local strange land storage.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710822007.5A CN110069682A (en) | 2017-09-14 | 2017-09-14 | A kind of internet web page acquisition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710822007.5A CN110069682A (en) | 2017-09-14 | 2017-09-14 | A kind of internet web page acquisition method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110069682A true CN110069682A (en) | 2019-07-30 |
Family
ID=67364513
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710822007.5A Pending CN110069682A (en) | 2017-09-14 | 2017-09-14 | A kind of internet web page acquisition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110069682A (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103455636A (en) * | 2013-09-27 | 2013-12-18 | 浪潮齐鲁软件产业有限公司 | Automatic capturing and intelligent analyzing method based on Internet tax data |
CN104750812A (en) * | 2015-03-30 | 2015-07-01 | 浪潮集团有限公司 | Automatic data collecting method based on webpage label analysis |
CN106959995A (en) * | 2016-12-21 | 2017-07-18 | 四川长虹电器股份有限公司 | Compatible two-way automatic web page contents acquisition method |
-
2017
- 2017-09-14 CN CN201710822007.5A patent/CN110069682A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103455636A (en) * | 2013-09-27 | 2013-12-18 | 浪潮齐鲁软件产业有限公司 | Automatic capturing and intelligent analyzing method based on Internet tax data |
CN104750812A (en) * | 2015-03-30 | 2015-07-01 | 浪潮集团有限公司 | Automatic data collecting method based on webpage label analysis |
CN106959995A (en) * | 2016-12-21 | 2017-07-18 | 四川长虹电器股份有限公司 | Compatible two-way automatic web page contents acquisition method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543086B (en) | Network data acquisition and display method oriented to multiple data sources | |
CN102930059B (en) | Method for designing focused crawler | |
Prelipcean et al. | MEILI: A travel diary collection, annotation and automation system | |
CN106126648B (en) | It is a kind of based on the distributed merchandise news crawler method redo log | |
CN108229810B (en) | Industry analysis system and method based on network information resources | |
CN106021583B (en) | Statistical method and system for page flow data | |
CN101676907A (en) | Method and system of directionally acquiring Internet resources | |
CN102542061B (en) | Intelligent product classification method | |
KR101801257B1 (en) | Text-Mining Application Technique for Productive Construction Document Management | |
CN103942335A (en) | Construction method of uninterrupted crawler system oriented to web page structure change | |
CN104391978A (en) | Method and device for storing and processing web pages of browsers | |
CN105468744A (en) | Big data platform for realizing tax public opinion analysis and full text retrieval | |
CN101441629A (en) | Automatic acquiring method of non-structured web page information | |
CN107194007A (en) | A kind of integrated management system of spacecraft isomery test data | |
CN103838796A (en) | Webpage structured information extraction method | |
CN104598536B (en) | A kind of distributed network information structuring processing method | |
CN106547749B (en) | Webpage data acquisition method and device | |
CN106055618A (en) | Data processing method based on web crawlers and structural storage | |
CN102306201A (en) | Method and system for analyzing webpage title | |
CN101819584B (en) | Light weight intelligent webpage content analysis method | |
CN108536700A (en) | A kind of method that nothing buries a collector journal | |
CN103177022A (en) | Method and device of malicious file search | |
CN104598570A (en) | Resource fetching method and device | |
CN103198078B (en) | A kind of internet news event report trend analysis and system | |
CN107086925B (en) | Deep learning-based internet traffic big data analysis method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190730 |
|
WD01 | Invention patent application deemed withdrawn after publication |