CN113297448A

CN113297448A - Open-source electric wave environment data acquisition method based on web crawler and computer readable storage medium

Info

Publication number: CN113297448A
Application number: CN202110522150.9A
Authority: CN
Inventors: 王洪明; 李静静; 孙树计; 刘书志; 王飞飞; 胡冉冉; 刘晓雷
Original assignee: China Institute of Radio Wave Propagation CETC 22 Research Institute
Current assignee: China Institute of Radio Wave Propagation CETC 22 Research Institute
Priority date: 2021-05-13
Filing date: 2021-05-13
Publication date: 2021-08-24
Anticipated expiration: 2041-05-13
Also published as: CN113297448B

Abstract

The invention discloses a network crawler-based open-source electric wave environment data acquisition method and a computer readable storage medium, wherein the method comprises the following steps: step 1, open source data investigation and webpage characteristic analysis: step 2, designing a web crawler: and 3, automatically acquiring system architecture and integrated application. The method disclosed by the invention can be used for researching and screening various data sources related to the electric wave environment, deeply analyzing the information such as the data type, the acquisition means, the acquisition time, the update frequency and the like, and bringing the data source with high association degree and high reliability with the electric wave environment into the object of electric wave environment network crawler data acquisition.

Description

Open-source electric wave environment data acquisition method based on web crawler and computer readable storage medium

Technical Field

The invention belongs to the technical field of military information calculation and processing, and particularly relates to a network crawler-based open-source electric wave environment data acquisition method and a computer-readable storage medium in the field.

Background

With the increasing share of global data resources, the internet has become an important source for obtaining information. The united states NOAA website, the Ionosphere Prediction Service (IPS) website in australia, the public service website (cmax, t7online) in the domestic meteorological department and the like all have massive multi-source heterogeneous electric wave environment support data for obtaining, however, the labor cost for customizing data acquisition programs for different websites is high, and it is urgently needed to design and implement an open-source electric wave environment data acquisition method based on a web crawler, by using a Web crawler tool, a plurality of electric wave environment data sources are integrated, the Web site structure and the webpage characteristic information of each data source are analyzed, corresponding data acquisition rules are configured in a self-defined mode, automatic acquisition of massive electric wave environment open source data is achieved, a convenient and low-cost means is provided for acquisition of open source electric wave environment data resources, the data support requirements of electric wave environment event detection and electric wave environment situation analysis in the later period are met, and a foundation is laid for improvement of electric wave environment information guarantee capability in the big data era.

At present, network data acquisition is mainly completed by technologies such as a web spider or a data acquisition robot which comprehensively uses a vertical search engine technology. In foreign countries, Madhusudan and the like propose a method for focusing semantic web crawlers for solving the problem that a search engine lacks indexes on deep web pages. Hyo-Jung et al designed and implemented a crawler algorithm that could automatically collect dynamically generated web pages from deep web, using scripts rather than keywords as links. Kumar et al propose a query-based crawler method that uses a set of keywords related to a topic of interest to a user to deliver the keywords to a search query interface of a website corresponding to a URL, thereby obtaining the most relevant links. Many domestic scholars also carry out research in the aspect of data acquisition, and Liupeng of the university of electronic science and technology designs a data acquisition method which is oriented to multiple data sources, can carry out detailed configuration on the acquisition source, the acquisition depth and the acquisition category and can carry out control according to the acquisition frequency and the parallel multithreading scheduling in a 'Jingdong City public opinion analysis system'. The compass uses the script multithreading crawler frame to realize the functions of simulating login, dynamic webpage grabbing, overcoming a microblog anti-crawler mechanism and the like.

In addition to the above-mentioned research on data acquisition by scholars at home and abroad, there are many enterprises engaged in mass data acquisition at home, most of them use vertical search engine technology to realize data acquisition, and some enterprises realize comprehensive application of multiple technologies. The train collector combines mass data collection and post-processing by adopting vertical search engine, network radar, information tracking, automatic sorting and automatic indexing technologies. The "octopus collector" of Shenzhen View information technology Limited company can acquire a large amount of normalized data from different websites or webpages in a short time, and helps customers to realize automatic data acquisition, editing and normalization, thereby weakening the dependence on manual search and data collection.

In the electric wave environment body construction technology research, main events of the electric wave environment and relevant factors influencing weaponry are sorted according to the relevant principle of action and influence of the electric wave environment on a weaponry system, a battlefield electric wave environment guarantee body model is provided based on knowledge of body engineering, construction of the battlefield electric wave environment guarantee body model is completed by using a body modeling tool project, and a research basis is provided for automatic acquisition of electric wave environment data based on a body knowledge base. Dozens of open source websites of radio wave environment have been collected, including american space weather forecast center (http:// www.swpc.noaa.gov), american geophysical data center (http:// spidr. ngdc. gov), jet dynamics laboratory (http:// iono.jpl. nasa. gov), space weather and ionospheric scintillation service (http:// www.nwra-az. com), weather online (http:// t7online. com), american wye university weather service website (http:// weather. uwyo. edu), etc., and download of partial website data has been completed by developing a custom data collection program, and by being familiar with HTML web page layout, all contents in HTML documents can be represented by nodes in DOM tree structure according to HTML DOM standard of W3C, and CSS selector or Xpath can parse pages to extract desired data information extracted by relevant nodes. The research results lay a technical foundation for developing the research of the open-source electric wave environment data acquisition method based on the web crawler.

Disclosure of Invention

The invention aims to provide a web crawler-based open-source electric wave environment data acquisition method and a computer-readable storage medium.

The invention adopts the following technical scheme:

the improvement of an open-source electric wave environment data acquisition method based on web crawlers is that the method comprises the following steps:

step 1, open source data investigation and webpage characteristic analysis:

step 11, carrying out investigation and analysis of open source electric wave environment related data, carrying out investigation and screening on a plurality of electric wave environment related data sources, preferably selecting a data source which can be used for electric wave environment situation analysis and has high correlation degree with the electric wave environment situation analysis, deeply analyzing the data type, the acquisition means, the acquisition time and the update frequency, carrying out quality analysis of an electric wave environment information source by means of an internet evaluation tool, and bringing the data source with high correlation degree and high reliability with the electric wave environment into an object for electric wave environment network crawler data acquisition;

step 12, researching the distribution characteristics and the change rules of open source electric wave environment webpages, sequentially analyzing the collected electric wave environment open source websites, mainly performing statistical analysis on the structure composition of data to be collected, whether login and verification codes are needed, whether ajax new technology is adopted, completing classification and grading of the electric wave environment open source websites, and formulating data crawling strategies of websites with different grades;

step 2, designing a web crawler:

step 21, customizing an acquisition rule, classifying the electric wave environment data to be acquired according to different layouts on a webpage, and configuring the acquisition rule for different design style website data of single data, list data, form data and page-turning multi-page data one by one to form an acquisition task;

step 22, designing a special website data acquisition rule, configuring the rule by a browser mode, memorizing Cookie or inputting an account number, inputting a password and clicking to login to form an acquisition task of a website needing to be logged in and a verification code, and setting a timeout time configuration rule by ajax to form an acquisition task of a website adopting ajax new technology;

step 23, extracting and warehousing data, wherein data extraction is realized through an optimized configuration acquisition rule including adding special fields, adjusting field positions, combining fields, formatting fields and regular expressions, so as to form data which finally meets requirements, and the data is periodically imported into a database through data field mapping;

step 3, automatic acquisition system architecture and integrated application:

step 31, designing a distributed cloud framework acquisition system, wherein the whole system is deployed on a cloud platform and comprises a main program, a monitoring program, an acquisition rule configuration client, a distributed acquisition cluster and a storage cluster, and the main program deploys services including connection of the client, cloud nodes, cloud acquisition data and account information storage; the monitoring program provides service resource management, node resource management and task control and monitoring; the acquisition rule configuration client visually makes an acquisition rule flow by simulating manual webpage browsing operation, and acquisition rule configuration data are stored in a configuration database;

step 32, carrying out integrated application in the electric wave environment application system, carrying out seamless butt joint design on the electric wave environment application system directly deployed on the Internet and the electric wave environment web crawler acquisition system, and setting sampling frequencies including minutes, hours, days, weeks and months; for the electric wave environment application system deployed across the network, the electric wave environment web crawler acquisition system supports the export of various data formats including Excel, SQL, TXT and MySQL, and only needs to periodically launch data import or write a data integration interface program.

Further, in step 21, the step of collecting the single data is:

(a) creating a new task, inputting a website, respectively selecting fields needing to be collected, selecting and collecting texts of the element for the text fields, and selecting and collecting the picture addresses for the picture fields;

(b) editing the field, modifying the name of the field, and performing more operations on the field, including deletion, copying and formatting;

(c) and acquiring and exporting the data, running a configured acquisition task, selecting a proper export mode to export the data after the acquisition is finished, and supporting Excel, CSV and HTML.

Further, in step 21, the step of collecting the list data is:

(a) selecting a data list on a page, wherein the selected range needs to be the maximum and comprises all fields to be collected;

(b) selecting the sub-elements of the data list, prompting the selected sub-elements on an interface, finding the similar elements at the same time, and selecting 'all selected';

(c) after the selection of all fields is confirmed, the data acquisition is started, and the setting of 'cycle-extraction data' is completed.

Further, in step 21, the step of collecting the table data is:

(a) selecting a first cell of a first list on a page, selecting an expanded selection area in a prompt box, and selecting the expanded selection area to a whole line, wherein the specific field of each line is positioned and automatically identified;

(b) selecting 'selected child elements', prompting the finding of similar elements by an interface, selecting 'selected all', and selecting all form child elements in a page;

(c) all fields in the current rule default acquisition list are modified and deleted in a field editing function interface, and data acquisition is started.

Further, in step 21, the step of collecting page-turning multi-page data includes:

(a) firstly, establishing a 'circulation-data extraction' process to finish the acquisition of list data on a first page;

(b) and finding and clicking the next page in the page, and then selecting the loop in the prompt box to click the next page to complete the establishment of the page turning loop.

(c) Starting the collected data to complete the formulation of the collection rule of page turning and page spreading.

In a computer-readable storage medium having stored thereon a radio wave environment web crawler collection system program, the improvement comprising: when being executed by a processor, the electric wave environment web crawler acquisition system program realizes the steps of the open-source electric wave environment data acquisition method based on the web crawler.

The invention has the beneficial effects that:

the method disclosed by the invention can be used for researching and screening various data sources related to the electric wave environment, deeply analyzing the information such as the data type, the acquisition means, the acquisition time, the update frequency and the like, and bringing the data source with high association degree and high reliability with the electric wave environment into the object of electric wave environment network crawler data acquisition. And analyzing the open source websites in sequence, finishing the classification and grading of the open source websites in the electric wave environment, and formulating data crawling strategies of websites with different grades. According to the layout classification of open source website data on a webpage, different data acquisition rules of single data, list data, table data, page-turning multi-page data and the like are designed, and meanwhile, special rules are adopted for processing data sources which need to be logged in and verified and adopt ajax new technology and the like and have higher difficulty. On the basis, a distributed cloud framework acquisition system of the electric wave environment web crawler is built, and webpage analysis, data acquisition, data cleaning and data storage of various fragmented information of the electric wave environment open source website are integrally completed. For the electric wave environment application systems deployed in the same network and across networks, full-automatic and semi-automatic docking schemes are designed, unattended operation and efficient operation and maintenance of the application systems are realized, and a foundation is laid for improving the electric wave environment information guarantee capability in the big data era.

The method disclosed by the invention utilizes the data acquired by the radio wave environment network crawler acquisition system as effective supplement of the observation data of the national defense science and technology industrial radio wave observation station network, develops the environmental guarantee outside the coverage area of the observation station network, and solves the problem that the radio wave environment observation station network has relatively weak capability of observing, forecasting and guaranteeing the service of the radio wave environment of the military hot spot area around China. The data product formed by the research result can be used for users in the industry fields of communication/broadcasting, detection/detection, navigation positioning, remote sensing and the like, and provides data support and application guarantee for planning demonstration, design development, service application and the like of related electronic information systems.

Drawings

FIG. 1 is a schematic flow chart of the method disclosed in example 1 of the present invention;

FIG. 2 is a technical route diagram of the method disclosed in example 1 of the present invention;

FIG. 3 is a schematic diagram of a single data collection rule and a collection field configuration in the method disclosed in embodiment 1 of the present invention;

FIG. 4 is a schematic diagram of a configuration of a list data collection rule and a collection field in the method disclosed in embodiment 1 of the present invention;

FIG. 5 is a schematic diagram of a cyclic collection rule configuration of the form data in the method disclosed in embodiment 1 of the present invention;

FIG. 6 is a schematic diagram of a configuration of a table data collection field in the method disclosed in embodiment 1 of the present invention;

FIG. 7 is a diagram illustrating a configuration of page-turning multi-page data collection rules in the method disclosed in embodiment 1 of the present invention;

FIG. 8 is a schematic diagram of data being collected by the method disclosed in example 1 of the present invention;

fig. 9 is a schematic diagram of configuration of ajax load data in the method disclosed in embodiment 1 of the present invention;

FIG. 10 is a schematic diagram of a data export configuration in the method disclosed in embodiment 1 of the present invention;

fig. 11 is a schematic system architecture diagram of an integrated application of the electric wave environment web crawler acquisition system and the electric wave environment application system in the method disclosed in embodiment 1 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Embodiment 1, this embodiment discloses a method for collecting open-source radio wave environment data based on a web crawler, as shown in fig. 1 and 2, including the following steps:

step 1, open source data investigation and webpage characteristic analysis:

step 11, carrying out investigation and analysis of open source electric wave environment related data, carrying out investigation and screening on a plurality of electric wave environment related data sources, preferably selecting a data source which can be used for electric wave environment situation analysis and has high correlation degree with the electric wave environment situation analysis, deeply analyzing information such as data type, acquisition means, acquisition time, updating frequency and the like, carrying out quality analysis on the electric wave environment information source by means of an internet evaluation tool, and bringing the data source with high correlation degree and high reliability with the electric wave environment into an object for electric wave environment network crawler data acquisition;

step 12, researching the distribution characteristics and the change rules of the open source electric wave environment web pages, sequentially analyzing tens of collected electric wave environment open source websites, mainly performing statistical analysis on the aspects of influencing the design and implementation of a web crawler, such as the structural composition of data to be collected, whether login and verification codes are needed, whether ajax new technology is adopted and the like, completing the classification and grading of the electric wave environment open source websites, and formulating data crawling strategies of websites with different grades;

step 2, designing a web crawler:

step 21, customizing an acquisition rule, classifying the electric wave environment data to be acquired according to different layouts on a webpage, and configuring the acquisition rule for website data with different design styles, such as single data, list data, form data, page-turning multi-page data and the like one by one to form an acquisition task;

as shown in fig. 3, a single data is collected, for example, http:// stereo-ssc. nascom. nasa. gov/beacon/beacon _ insitu. shtml data, there are many fields on a web page: text (IMPACT/PLASTIC solar wind data, IMPACT solar electronic data, SWAVES radio data), picture (data corresponding to picture), time, etc. The fields are collected and stored as structured data, and the main operation flow is as follows:

(b) editing the field, modifying the name of the field, and performing more operations on the field, including deletion, copying, formatting and the like;

As shown in fig. 4, tabular data was collected, for example https:// ui. addabs. harvard. edu/abs/2004A% 26a.. 425.1097R/coordinates data, having many structurally identical data lists, each with identical fields: serial number, article number, time, article title, author, etc. The fields in the lists on the webpage are sequentially collected according to the webpage arrangement sequence, and the key work is to establish a 'circulation-data extraction' process.

As shown in fig. 5 and 6, table data is collected,

taking https:// cdaw.gsfc.nasa.gov/CME _ list/UNIVERSAL/1996_01/univ1996_01.html data as an example, a table is arranged on a webpage, the structure is very neat, each piece of information occupies one line of the table, and each line comprises a plurality of pieces of field information: first C2 application Date Time, Central PA, Angular Width, Linear Speed, 2nd-order Speed at final height, 2nd-order Speed at 20 Rs, Accel, Mass, Kinetic Energy, MPA, etc. The fields are collected and stored in the form of Excel and the like, and the key work is also to establish a 'loop-extraction data' process.

As shown in fig. 7, page-turning and page-turning data is collected, and similarly, taking https:// ui. addabs. harvard. edu/abs/2004A% 26a.. 425.1097R/locations data as an example, a single page of the web page is list data, but a page needs to be turned backwards to complete the collection of all data, and the main operation flow is as follows:

And step 22, designing a special website data acquisition rule, configuring the rule by a browser mode, memorizing Cookie or inputting an account number, inputting a password and clicking login for a website needing login and a verification code, wherein the verification code is needed by the individual website, and the verification code identification control is used to realize automatic code scanning. Aiming at a website adopting the new ajax technology, ajax setting is needed, ajax timeout time is set, and acquisition rules are configured one by one according to website data with different design styles, such as single data, list data, table data, page-turning multi-page data and the like to form an acquisition task. FIG. 8 is a schematic diagram of data being collected; FIG. 9 is a schematic diagram of an ajax load data configuration;

and step 23, extracting and warehousing the data, configuring the acquisition rule, and simultaneously optimizing the acquisition rule, wherein the acquisition rule comprises adding special fields, adjusting field positions, combining the fields, formatting the fields, performing regular expressions and the like to form data which finally meet requirements, and the data can be periodically imported into a database through data field mapping.

Step 3, automatic acquisition system architecture and integrated application:

step 31, designing a distributed cloud framework acquisition system, wherein the whole system is deployed on a cloud platform and comprises a main program, a monitoring program, an acquisition rule configuration client, a distributed acquisition cluster and a storage cluster, and the main program is deployed and connected with multiple services such as the client, a cloud node, cloud acquisition data, account information storage and the like; the monitoring program provides functions of service resource management, node resource management, task control, monitoring and the like; the acquisition rule configuration client visually makes an acquisition rule flow by simulating manual webpage browsing operation, and acquisition rule configuration data are stored in a configuration database.

And step 32, carrying out integrated application in the electric wave environment application system, and carrying out full-automatic and semi-automatic butt joint design on the electric wave environment application system deployed in the same network and the cross-network with the electric wave environment web crawler acquisition system respectively. Specifically, a radio wave environment application system directly deployed on the internet and a radio wave environment web crawler acquisition system are seamlessly butted, various sampling frequencies such as minutes, hours, days, weeks and months are set, automatic data acquisition and warehousing are achieved, and the application system runs unattended. For the radio wave environment application system deployed across the network, the radio wave environment web crawler acquisition system supports export of multiple data formats such as Excel, SQL, TXT, MySQL and the like, and a user can realize efficient operation and maintenance of the radio wave environment application system only by regularly carrying out data import or simply writing a data integration interface program. FIG. 10 is a schematic diagram of a data export configuration; fig. 11 is a schematic diagram of a system architecture of an application of the radio wave environment web crawler collection system and a radio wave environment application system.

The embodiment also discloses a computer readable storage medium, on which a radio wave environment web crawler collection system program is stored, and when the radio wave environment web crawler collection system program is executed by a processor, the steps of the open-source radio wave environment data collection method based on the web crawler are implemented.

Claims

1. A method for collecting open-source electric wave environment data based on web crawlers is characterized by comprising the following steps:

step 1, open source data investigation and webpage characteristic analysis:

step 2, designing a web crawler:

step 3, automatic acquisition system architecture and integrated application:

2. The open-source electric wave environment data acquisition method based on the web crawler according to claim 1, wherein in step 21, the step of acquiring single data comprises:

3. The open-source radiowave environment data collection method based on web crawlers according to claim 1, characterized in that in step 21, the step of collecting list data is:

4. The open-source radiowave environment data collection method based on web crawler according to claim 1, wherein in step 21, the step of collecting table data is:

5. The open-source radiowave environment data collection method based on the web crawler according to claim 1, wherein in step 21, the step of collecting page-turning multi-page data comprises:

6. A computer-readable storage medium on which a radio wave environment web crawler collection system program is stored, characterized in that: the electric wave environment web crawler collecting system program realizes the steps of the open source electric wave environment data collecting method based on the web crawler in any one of claims 1 to 5 when being executed by a processor.