CN110909228A

CN110909228A - Data extraction method based on web crawler mechanism

Info

Publication number: CN110909228A
Application number: CN201911144777.4A
Authority: CN
Inventors: 贺洪煜; 房霆宸; 赵一鸣; 陈渊鸿; 吴联定
Original assignee: Shanghai Construction Group Co Ltd
Current assignee: Shanghai Construction Group Co Ltd
Priority date: 2019-11-21
Filing date: 2019-11-21
Publication date: 2020-03-24

Abstract

The invention provides a mode for actively acquiring data, namely a data extraction method based on a web crawler mechanism, which can quickly extract a large amount of data from the existing monitoring platform and quickly construct an enterprise-level big data environment. The data extraction method based on the web crawler mechanism comprises the following steps: firstly, adding marks to monitoring data in a human-computer interface of each heterogeneous system; and step two, capturing the monitoring data by adopting a crawler algorithm according to the marks.

Description

Data extraction method based on web crawler mechanism

Technical Field

The invention belongs to the technical field of informatization, and particularly relates to a data extraction method based on a web crawler mechanism.

Background

The existing monitoring platforms are different in types, data acquisition modes and monitoring depths, and an enterprise-level integral monitoring platform is difficult to plan. Essentially, for enterprise-level applications, the core is the required monitoring data, and the monitoring depth, the page form, the access method and the like of the existing monitoring platform are not concerned. The current data acquisition mode is mainly based on the API mode. However, this method is a passive method, that is, what API is provided by the monitoring platform and what data can be obtained, which invisibly increases the difficulty of data acquisition and limits the range thereof, and meanwhile, the existing manually uploaded text and the like can only be obtained after being converted into the corresponding data format.

Disclosure of Invention

The invention provides a mode for actively acquiring data, namely a data extraction method based on a web crawler mechanism, which can quickly extract a large amount of data from the existing monitoring platform and quickly construct an enterprise-level big data environment.

The technical scheme of the data extraction method based on the web crawler mechanism comprises the following steps:

a data extraction method based on a web crawler mechanism comprises the following steps:

firstly, adding marks to monitoring data in a human-computer interface of each heterogeneous system; the indicia may be predefined, such as the indicia of temperature monitoring data for facility one of the heterogeneous system one may be defined as "A _ a _001_ wd"; if the man-machine interface of the heterogeneous system is in HTML form, the mark can be set to the id of a certain < div > tag in HTML codes;

and step two, capturing the monitoring data by adopting a crawler algorithm according to the marks.

According to the data extraction method based on the network crawler mechanism, the marks are added to the monitoring data in the human-computer interface of each heterogeneous system, target guidance is provided for a crawler program, a mode of actively acquiring data is formed, automatic data extraction work is carried out on all the heterogeneous systems within 24 hours, and therefore a large amount of data can be rapidly extracted from the existing monitoring platform, and an enterprise-level big data environment is rapidly constructed.

Further, in the data extraction method based on the web crawler mechanism, the first step further includes adding a mark to the monitoring data in each document. After the monitoring data in each document is added with the mark, the monitoring data of the document can be captured by a crawler algorithm and fused with the monitoring data in the human-computer interface of each heterogeneous system.

Further, in the data extraction method based on the web crawler mechanism, the first step further comprises the step of grading the monitoring data; and step two, setting different capture cycles for the monitoring data of different levels. Different monitoring data often need different frequency of snatching, consequently, carry out the grade to monitoring data and set up different snatching the cycle to the monitoring data of different grades and can effectively improve data acquisition's efficiency. For example, the ranking may be based on the nature of the data, or may be based on the importance of each heterogeneous system.

Further, in the data extraction method based on the web crawler mechanism, the first step further includes establishing a data mode definition of the monitoring data; and thirdly, performing data conversion on the data type and the display mode of the captured monitoring data through a data conversion system to generate a standard data format file meeting the definition of the data mode. The data mode definition is established according to the enterprise data standard, the data access standard of the accessed information system can be standardized, and a standardized data format reference basis is provided for future information system development.

Further, in the data extraction method based on the web crawler mechanism, the method specifically includes a fourth step of reading the standard data format file to a computer system, and storing the standard data format file in a database after program processing.

Further, in the data extraction method based on the web crawler mechanism, specifically, the data Schema is defined as XML Schema Definition, and the standard data format file is an XML file.

Further, in the data extraction method based on the web crawler mechanism, specifically, the human-computer interface of the heterogeneous system is in an HTML form; the document is in a word form, an excel form or a pdf form.

Drawings

Fig. 1 is a schematic flow chart of a data extraction method based on a web crawler mechanism according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples. Advantages and features of the present invention will become apparent from the following description and from the claims. It is to be noted that the drawings are in a very simplified form and are not to precise scale, which is merely for the purpose of facilitating and distinctly claiming the embodiments of the present invention.

Example 1:

referring to fig. 1, a data extraction method based on a web crawler mechanism of this embodiment includes the following steps:

According to the data extraction method based on the network crawler mechanism, marks are added to monitoring data in a human-computer interface of each heterogeneous system, target guidance is provided for a crawler program, a mode of actively acquiring data is formed, automatic data extraction work is carried out on all the heterogeneous systems within 24 hours, and therefore a large amount of data can be rapidly extracted from an existing monitoring platform, and an enterprise-level big data environment is rapidly constructed.

As a preferred implementation manner, in the data extraction method based on the web crawler mechanism, the first step further includes adding a mark to the monitoring data in each document. After the monitoring data in each document is added with the mark, the monitoring data of the document can be captured by a crawler algorithm and fused with the monitoring data in the human-computer interface of each heterogeneous system.

As a preferred embodiment, in the data extraction method based on the web crawler mechanism, the first step further includes classifying the monitoring data; and step two, setting different capture cycles for the monitoring data of different levels. Different monitoring data often need different frequency of snatching, consequently, carry out the grade to monitoring data and set up different snatching the cycle to the monitoring data of different grades and can effectively improve data acquisition's efficiency. For example, the ranking may be based on the nature of the data, or may be based on the importance of each heterogeneous system.

As a preferred embodiment, in the data extraction method based on the web crawler mechanism, the first step further includes establishing a data mode definition of the monitoring data; and thirdly, performing data conversion on the data type and the display mode of the captured monitoring data through a data conversion system to generate a standard data format file meeting the definition of the data mode. The data mode definition is established according to the enterprise data standard, the data access standard of the accessed information system can be standardized, and a standardized data format reference basis is provided for future information system development.

As a preferred embodiment, the data extraction method based on the web crawler mechanism specifically includes a fourth step of reading the standard data format file into a computer system, and storing the standard data format file into a database after program processing.

As a preferred embodiment, in the data extraction method based on the web crawler mechanism, specifically, the data Schema is defined as XML Schema Definition, and the standard data format file is an XML file.

As a preferred embodiment, in the data extraction method based on the web crawler mechanism, specifically, the human-computer interface of the heterogeneous system is in an HTML form; the document is in a word form, an excel form or a pdf form.

The above description is only for the purpose of describing the preferred embodiments of the present invention, and is not intended to limit the scope of the present invention, and any variations and modifications made by those skilled in the art based on the above disclosure are within the scope of the appended claims.

Claims

1. A data extraction method based on a web crawler mechanism is characterized by comprising the following steps:

firstly, adding marks to monitoring data in a human-computer interface of each heterogeneous system;

2. The web crawler-based data extraction method as recited in claim 1, wherein the first step further comprises adding a mark to the monitoring data in each document.

3. The web crawler-based data extraction method as recited in claim 1, wherein the first step further comprises the steps of ranking the monitoring data; and step two, setting different capture cycles for the monitoring data of different levels.

4. The web crawler mechanism based data extraction method as recited in claim 1, wherein the first step further comprises establishing a data pattern definition of the monitoring data; and thirdly, performing data conversion on the data type and the display mode of the captured monitoring data through a data conversion system to generate a standard data format file meeting the definition of the data mode.

5. The method for extracting data based on web crawler mechanism as recited in claim 4, further comprising a fourth step of reading the standard data format file to a computer system, and storing the standard data format file in the database after program processing.

6. The method for data extraction based on web crawler mechanism as recited in claim 4, wherein said data Schema is defined as XML Schema Definition, and said standard data format file is XML file.

7. The web crawler-based data extraction method as recited in claim 2, wherein the human-machine interface of the heterogeneous system is in an HTML form; the document is in a word form, an excel form or a pdf form.