CN110909228A - Data extraction method based on web crawler mechanism - Google Patents

Data extraction method based on web crawler mechanism Download PDF

Info

Publication number
CN110909228A
CN110909228A CN201911144777.4A CN201911144777A CN110909228A CN 110909228 A CN110909228 A CN 110909228A CN 201911144777 A CN201911144777 A CN 201911144777A CN 110909228 A CN110909228 A CN 110909228A
Authority
CN
China
Prior art keywords
data
web crawler
extraction method
monitoring
recited
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911144777.4A
Other languages
Chinese (zh)
Inventor
贺洪煜
房霆宸
赵一鸣
陈渊鸿
吴联定
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Construction Group Co Ltd
Original Assignee
Shanghai Construction Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Construction Group Co Ltd filed Critical Shanghai Construction Group Co Ltd
Priority to CN201911144777.4A priority Critical patent/CN110909228A/en
Publication of CN110909228A publication Critical patent/CN110909228A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention provides a mode for actively acquiring data, namely a data extraction method based on a web crawler mechanism, which can quickly extract a large amount of data from the existing monitoring platform and quickly construct an enterprise-level big data environment. The data extraction method based on the web crawler mechanism comprises the following steps: firstly, adding marks to monitoring data in a human-computer interface of each heterogeneous system; and step two, capturing the monitoring data by adopting a crawler algorithm according to the marks.

Description

Data extraction method based on web crawler mechanism
Technical Field
The invention belongs to the technical field of informatization, and particularly relates to a data extraction method based on a web crawler mechanism.
Background
The existing monitoring platforms are different in types, data acquisition modes and monitoring depths, and an enterprise-level integral monitoring platform is difficult to plan. Essentially, for enterprise-level applications, the core is the required monitoring data, and the monitoring depth, the page form, the access method and the like of the existing monitoring platform are not concerned. The current data acquisition mode is mainly based on the API mode. However, this method is a passive method, that is, what API is provided by the monitoring platform and what data can be obtained, which invisibly increases the difficulty of data acquisition and limits the range thereof, and meanwhile, the existing manually uploaded text and the like can only be obtained after being converted into the corresponding data format.
Disclosure of Invention
The invention provides a mode for actively acquiring data, namely a data extraction method based on a web crawler mechanism, which can quickly extract a large amount of data from the existing monitoring platform and quickly construct an enterprise-level big data environment.
The technical scheme of the data extraction method based on the web crawler mechanism comprises the following steps:
a data extraction method based on a web crawler mechanism comprises the following steps:
firstly, adding marks to monitoring data in a human-computer interface of each heterogeneous system; the indicia may be predefined, such as the indicia of temperature monitoring data for facility one of the heterogeneous system one may be defined as "A _ a _001_ wd"; if the man-machine interface of the heterogeneous system is in HTML form, the mark can be set to the id of a certain < div > tag in HTML codes;
and step two, capturing the monitoring data by adopting a crawler algorithm according to the marks.
According to the data extraction method based on the network crawler mechanism, the marks are added to the monitoring data in the human-computer interface of each heterogeneous system, target guidance is provided for a crawler program, a mode of actively acquiring data is formed, automatic data extraction work is carried out on all the heterogeneous systems within 24 hours, and therefore a large amount of data can be rapidly extracted from the existing monitoring platform, and an enterprise-level big data environment is rapidly constructed.
Further, in the data extraction method based on the web crawler mechanism, the first step further includes adding a mark to the monitoring data in each document. After the monitoring data in each document is added with the mark, the monitoring data of the document can be captured by a crawler algorithm and fused with the monitoring data in the human-computer interface of each heterogeneous system.
Further, in the data extraction method based on the web crawler mechanism, the first step further comprises the step of grading the monitoring data; and step two, setting different capture cycles for the monitoring data of different levels. Different monitoring data often need different frequency of snatching, consequently, carry out the grade to monitoring data and set up different snatching the cycle to the monitoring data of different grades and can effectively improve data acquisition's efficiency. For example, the ranking may be based on the nature of the data, or may be based on the importance of each heterogeneous system.
Further, in the data extraction method based on the web crawler mechanism, the first step further includes establishing a data mode definition of the monitoring data; and thirdly, performing data conversion on the data type and the display mode of the captured monitoring data through a data conversion system to generate a standard data format file meeting the definition of the data mode. The data mode definition is established according to the enterprise data standard, the data access standard of the accessed information system can be standardized, and a standardized data format reference basis is provided for future information system development.
Further, in the data extraction method based on the web crawler mechanism, the method specifically includes a fourth step of reading the standard data format file to a computer system, and storing the standard data format file in a database after program processing.
Further, in the data extraction method based on the web crawler mechanism, specifically, the data Schema is defined as XML Schema Definition, and the standard data format file is an XML file.
Further, in the data extraction method based on the web crawler mechanism, specifically, the human-computer interface of the heterogeneous system is in an HTML form; the document is in a word form, an excel form or a pdf form.
Drawings
Fig. 1 is a schematic flow chart of a data extraction method based on a web crawler mechanism according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples. Advantages and features of the present invention will become apparent from the following description and from the claims. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is merely for the purpose of facilitating and distinctly claiming the embodiments of the present invention.
Example 1:
referring to fig. 1, a data extraction method based on a web crawler mechanism of this embodiment includes the following steps:
firstly, adding marks to monitoring data in a human-computer interface of each heterogeneous system; the indicia may be predefined, such as the indicia of temperature monitoring data for facility one of the heterogeneous system one may be defined as "A _ a _001_ wd"; if the man-machine interface of the heterogeneous system is in HTML form, the mark can be set to the id of a certain < div > tag in HTML codes;
and step two, capturing the monitoring data by adopting a crawler algorithm according to the marks.
According to the data extraction method based on the network crawler mechanism, marks are added to monitoring data in a human-computer interface of each heterogeneous system, target guidance is provided for a crawler program, a mode of actively acquiring data is formed, automatic data extraction work is carried out on all the heterogeneous systems within 24 hours, and therefore a large amount of data can be rapidly extracted from an existing monitoring platform, and an enterprise-level big data environment is rapidly constructed.
As a preferred implementation manner, in the data extraction method based on the web crawler mechanism, the first step further includes adding a mark to the monitoring data in each document. After the monitoring data in each document is added with the mark, the monitoring data of the document can be captured by a crawler algorithm and fused with the monitoring data in the human-computer interface of each heterogeneous system.
As a preferred embodiment, in the data extraction method based on the web crawler mechanism, the first step further includes classifying the monitoring data; and step two, setting different capture cycles for the monitoring data of different levels. Different monitoring data often need different frequency of snatching, consequently, carry out the grade to monitoring data and set up different snatching the cycle to the monitoring data of different grades and can effectively improve data acquisition's efficiency. For example, the ranking may be based on the nature of the data, or may be based on the importance of each heterogeneous system.
As a preferred embodiment, in the data extraction method based on the web crawler mechanism, the first step further includes establishing a data mode definition of the monitoring data; and thirdly, performing data conversion on the data type and the display mode of the captured monitoring data through a data conversion system to generate a standard data format file meeting the definition of the data mode. The data mode definition is established according to the enterprise data standard, the data access standard of the accessed information system can be standardized, and a standardized data format reference basis is provided for future information system development.
As a preferred embodiment, the data extraction method based on the web crawler mechanism specifically includes a fourth step of reading the standard data format file into a computer system, and storing the standard data format file into a database after program processing.
As a preferred embodiment, in the data extraction method based on the web crawler mechanism, specifically, the data Schema is defined as XML Schema Definition, and the standard data format file is an XML file.
As a preferred embodiment, in the data extraction method based on the web crawler mechanism, specifically, the human-computer interface of the heterogeneous system is in an HTML form; the document is in a word form, an excel form or a pdf form.
The above description is only for the purpose of describing the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention, and any variations and modifications made by those skilled in the art based on the above disclosure are within the scope of the appended claims.

Claims (7)

1. A data extraction method based on a web crawler mechanism is characterized by comprising the following steps:
firstly, adding marks to monitoring data in a human-computer interface of each heterogeneous system;
and step two, capturing the monitoring data by adopting a crawler algorithm according to the marks.
2. The web crawler-based data extraction method as recited in claim 1, wherein the first step further comprises adding a mark to the monitoring data in each document.
3. The web crawler-based data extraction method as recited in claim 1, wherein the first step further comprises the steps of ranking the monitoring data; and step two, setting different capture cycles for the monitoring data of different levels.
4. The web crawler mechanism based data extraction method as recited in claim 1, wherein the first step further comprises establishing a data pattern definition of the monitoring data; and thirdly, performing data conversion on the data type and the display mode of the captured monitoring data through a data conversion system to generate a standard data format file meeting the definition of the data mode.
5. The method for extracting data based on web crawler mechanism as recited in claim 4, further comprising a fourth step of reading the standard data format file to a computer system, and storing the standard data format file in the database after program processing.
6. The method for data extraction based on web crawler mechanism as recited in claim 4, wherein said data Schema is defined as XML Schema Definition, and said standard data format file is XML file.
7. The web crawler-based data extraction method as recited in claim 2, wherein the human-machine interface of the heterogeneous system is in an HTML form; the document is in a word form, an excel form or a pdf form.
CN201911144777.4A 2019-11-21 2019-11-21 Data extraction method based on web crawler mechanism Pending CN110909228A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911144777.4A CN110909228A (en) 2019-11-21 2019-11-21 Data extraction method based on web crawler mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911144777.4A CN110909228A (en) 2019-11-21 2019-11-21 Data extraction method based on web crawler mechanism

Publications (1)

Publication Number Publication Date
CN110909228A true CN110909228A (en) 2020-03-24

Family

ID=69816990

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911144777.4A Pending CN110909228A (en) 2019-11-21 2019-11-21 Data extraction method based on web crawler mechanism

Country Status (1)

Country Link
CN (1) CN110909228A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975854A (en) * 2023-06-30 2023-10-31 国网吉林省电力有限公司辽源供电公司 Financial information intelligent storage supervision system and method based on big data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006065467A (en) * 2004-08-25 2006-03-09 Hitachi Ltd Device for creating data extraction definition information and method for creating data extraction definition information
US20120259833A1 (en) * 2011-04-11 2012-10-11 Vistaprint Technologies Limited Configurable web crawler
CN107402869A (en) * 2017-07-12 2017-11-28 东软集团股份有限公司 Collecting method, device and system
CN108520056A (en) * 2018-04-03 2018-09-11 北京京东金融科技控股有限公司 Business datum monitoring method and device, system, readable medium and electronic equipment
CN109101519A (en) * 2018-05-09 2018-12-28 广东辰宜信息科技有限公司 Information acquisition system and Heterogeneous Information emerging system
CN109325161A (en) * 2018-09-11 2019-02-12 五八有限公司 Public sentiment data grasping means, device, equipment and storage medium
CN110134841A (en) * 2018-02-09 2019-08-16 鼎复数据科技(北京)有限公司 The customized real-time method for obtaining website data

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006065467A (en) * 2004-08-25 2006-03-09 Hitachi Ltd Device for creating data extraction definition information and method for creating data extraction definition information
US20120259833A1 (en) * 2011-04-11 2012-10-11 Vistaprint Technologies Limited Configurable web crawler
CN107402869A (en) * 2017-07-12 2017-11-28 东软集团股份有限公司 Collecting method, device and system
CN110134841A (en) * 2018-02-09 2019-08-16 鼎复数据科技(北京)有限公司 The customized real-time method for obtaining website data
CN108520056A (en) * 2018-04-03 2018-09-11 北京京东金融科技控股有限公司 Business datum monitoring method and device, system, readable medium and electronic equipment
CN109101519A (en) * 2018-05-09 2018-12-28 广东辰宜信息科技有限公司 Information acquisition system and Heterogeneous Information emerging system
CN109325161A (en) * 2018-09-11 2019-02-12 五八有限公司 Public sentiment data grasping means, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何守财 等: "《数据库百科全书》", 上海交通大学出版社, pages: 517 - 535 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116975854A (en) * 2023-06-30 2023-10-31 国网吉林省电力有限公司辽源供电公司 Financial information intelligent storage supervision system and method based on big data
CN116975854B (en) * 2023-06-30 2024-01-23 国网吉林省电力有限公司辽源供电公司 Financial information intelligent storage supervision system and method based on big data

Similar Documents

Publication Publication Date Title
US9910842B2 (en) Interactively predicting fields in a form
CN102207946B (en) Knowledge network semi-automatic generation method
CN104517112A (en) Table recognition method and system
CN106934536A (en) Construction industry quantities valuation listings data autocoding and recognition methods and system
CN111159982B (en) Document editing method, device, electronic equipment and computer readable storage medium
CN113032672A (en) Method and device for extracting multi-modal POI (Point of interest) features
CN113705554A (en) Training method, device and equipment of image recognition model and storage medium
CN103853738A (en) Identification method for webpage information related region
CN111639178A (en) Automatic classification and interpretation of life science documents
CN106203229A (en) The terminal unit recognition Quick Response Code of different rights is with the method for the different information of display
Roy et al. A computer vision enabled damage detection model with improved yolov5 based on transformer prediction head
CN112650910A (en) Method, device, equipment and storage medium for determining website update information
CN110909228A (en) Data extraction method based on web crawler mechanism
CN113076961B (en) Image feature library updating method, image detection method and device
AU2015202463A1 (en) Capturing specific information based on field information associated with a document class
US20230260310A1 (en) Systems and methods for processing documents
CN115221893B (en) Quality inspection rule automatic configuration method and device based on rule and semantic analysis
CN108170838B (en) Topic evolution visualization display method, application server and computer readable storage medium
CN116453125A (en) Data input method, device, equipment and storage medium based on artificial intelligence
TWM590730U (en) Document management system base on AI
CN115186240A (en) Social network user alignment method, device and medium based on relevance information
CN114579834A (en) Webpage login entity identification method and device, electronic equipment and storage medium
CN103488665A (en) Method for extracting contents from selected region of html (hypertext markup language) page
CN113936130A (en) Document information intelligent acquisition and error correction method, system and equipment based on OCR technology
Sulaiman et al. A study on information extraction method of engineering drawing tables

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination