CN113297448A - Open-source electric wave environment data acquisition method based on web crawler and computer readable storage medium - Google Patents

Open-source electric wave environment data acquisition method based on web crawler and computer readable storage medium Download PDF

Info

Publication number
CN113297448A
CN113297448A CN202110522150.9A CN202110522150A CN113297448A CN 113297448 A CN113297448 A CN 113297448A CN 202110522150 A CN202110522150 A CN 202110522150A CN 113297448 A CN113297448 A CN 113297448A
Authority
CN
China
Prior art keywords
data
acquisition
wave environment
electric wave
selecting
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110522150.9A
Other languages
Chinese (zh)
Other versions
CN113297448B (en
Inventor
王洪明
李静静
孙树计
刘书志
王飞飞
胡冉冉
刘晓雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Institute of Radio Wave Propagation CETC 22 Research Institute
Original Assignee
China Institute of Radio Wave Propagation CETC 22 Research Institute
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Institute of Radio Wave Propagation CETC 22 Research Institute filed Critical China Institute of Radio Wave Propagation CETC 22 Research Institute
Priority to CN202110522150.9A priority Critical patent/CN113297448B/en
Publication of CN113297448A publication Critical patent/CN113297448A/en
Application granted granted Critical
Publication of CN113297448B publication Critical patent/CN113297448B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/906Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a network crawler-based open-source electric wave environment data acquisition method and a computer readable storage medium, wherein the method comprises the following steps: step 1, open source data investigation and webpage characteristic analysis: step 2, designing a web crawler: and 3, automatically acquiring system architecture and integrated application. The method disclosed by the invention can be used for researching and screening various data sources related to the electric wave environment, deeply analyzing the information such as the data type, the acquisition means, the acquisition time, the update frequency and the like, and bringing the data source with high association degree and high reliability with the electric wave environment into the object of electric wave environment network crawler data acquisition.

Description

Open-source electric wave environment data acquisition method based on web crawler and computer readable storage medium
Technical Field
The invention belongs to the technical field of military information calculation and processing, and particularly relates to a network crawler-based open-source electric wave environment data acquisition method and a computer-readable storage medium in the field.
Background
With the increasing share of global data resources, the internet has become an important source for obtaining information. The united states NOAA website, the Ionosphere Prediction Service (IPS) website in australia, the public service website (cmax, t7online) in the domestic meteorological department and the like all have massive multi-source heterogeneous electric wave environment support data for obtaining, however, the labor cost for customizing data acquisition programs for different websites is high, and it is urgently needed to design and implement an open-source electric wave environment data acquisition method based on a web crawler, by using a Web crawler tool, a plurality of electric wave environment data sources are integrated, the Web site structure and the webpage characteristic information of each data source are analyzed, corresponding data acquisition rules are configured in a self-defined mode, automatic acquisition of massive electric wave environment open source data is achieved, a convenient and low-cost means is provided for acquisition of open source electric wave environment data resources, the data support requirements of electric wave environment event detection and electric wave environment situation analysis in the later period are met, and a foundation is laid for improvement of electric wave environment information guarantee capability in the big data era.
At present, network data acquisition is mainly completed by technologies such as a web spider or a data acquisition robot which comprehensively uses a vertical search engine technology. In foreign countries, Madhusudan and the like propose a method for focusing semantic web crawlers for solving the problem that a search engine lacks indexes on deep web pages. Hyo-Jung et al designed and implemented a crawler algorithm that could automatically collect dynamically generated web pages from deep web, using scripts rather than keywords as links. Kumar et al propose a query-based crawler method that uses a set of keywords related to a topic of interest to a user to deliver the keywords to a search query interface of a website corresponding to a URL, thereby obtaining the most relevant links. Many domestic scholars also carry out research in the aspect of data acquisition, and Liupeng of the university of electronic science and technology designs a data acquisition method which is oriented to multiple data sources, can carry out detailed configuration on the acquisition source, the acquisition depth and the acquisition category and can carry out control according to the acquisition frequency and the parallel multithreading scheduling in a 'Jingdong City public opinion analysis system'. The compass uses the script multithreading crawler frame to realize the functions of simulating login, dynamic webpage grabbing, overcoming a microblog anti-crawler mechanism and the like.
In addition to the above-mentioned research on data acquisition by scholars at home and abroad, there are many enterprises engaged in mass data acquisition at home, most of them use vertical search engine technology to realize data acquisition, and some enterprises realize comprehensive application of multiple technologies. The train collector combines mass data collection and post-processing by adopting vertical search engine, network radar, information tracking, automatic sorting and automatic indexing technologies. The "octopus collector" of Shenzhen View information technology Limited company can acquire a large amount of normalized data from different websites or webpages in a short time, and helps customers to realize automatic data acquisition, editing and normalization, thereby weakening the dependence on manual search and data collection.
In the electric wave environment body construction technology research, main events of the electric wave environment and relevant factors influencing weaponry are sorted according to the relevant principle of action and influence of the electric wave environment on a weaponry system, a battlefield electric wave environment guarantee body model is provided based on knowledge of body engineering, construction of the battlefield electric wave environment guarantee body model is completed by using a body modeling tool project, and a research basis is provided for automatic acquisition of electric wave environment data based on a body knowledge base. Dozens of open source websites of radio wave environment have been collected, including american space weather forecast center (http:// www.swpc.noaa.gov), american geophysical data center (http:// spidr. ngdc. gov), jet dynamics laboratory (http:// iono.jpl. nasa. gov), space weather and ionospheric scintillation service (http:// www.nwra-az. com), weather online (http:// t7online. com), american wye university weather service website (http:// weather. uwyo. edu), etc., and download of partial website data has been completed by developing a custom data collection program, and by being familiar with HTML web page layout, all contents in HTML documents can be represented by nodes in DOM tree structure according to HTML DOM standard of W3C, and CSS selector or Xpath can parse pages to extract desired data information extracted by relevant nodes. The research results lay a technical foundation for developing the research of the open-source electric wave environment data acquisition method based on the web crawler.
Disclosure of Invention
The invention aims to provide a web crawler-based open-source electric wave environment data acquisition method and a computer-readable storage medium.
The invention adopts the following technical scheme:
the improvement of an open-source electric wave environment data acquisition method based on web crawlers is that the method comprises the following steps:
step 1, open source data investigation and webpage characteristic analysis:
step 11, carrying out investigation and analysis of open source electric wave environment related data, carrying out investigation and screening on a plurality of electric wave environment related data sources, preferably selecting a data source which can be used for electric wave environment situation analysis and has high correlation degree with the electric wave environment situation analysis, deeply analyzing the data type, the acquisition means, the acquisition time and the update frequency, carrying out quality analysis of an electric wave environment information source by means of an internet evaluation tool, and bringing the data source with high correlation degree and high reliability with the electric wave environment into an object for electric wave environment network crawler data acquisition;
step 12, researching the distribution characteristics and the change rules of open source electric wave environment webpages, sequentially analyzing the collected electric wave environment open source websites, mainly performing statistical analysis on the structure composition of data to be collected, whether login and verification codes are needed, whether ajax new technology is adopted, completing classification and grading of the electric wave environment open source websites, and formulating data crawling strategies of websites with different grades;
step 2, designing a web crawler:
step 21, customizing an acquisition rule, classifying the electric wave environment data to be acquired according to different layouts on a webpage, and configuring the acquisition rule for different design style website data of single data, list data, form data and page-turning multi-page data one by one to form an acquisition task;
step 22, designing a special website data acquisition rule, configuring the rule by a browser mode, memorizing Cookie or inputting an account number, inputting a password and clicking to login to form an acquisition task of a website needing to be logged in and a verification code, and setting a timeout time configuration rule by ajax to form an acquisition task of a website adopting ajax new technology;
step 23, extracting and warehousing data, wherein data extraction is realized through an optimized configuration acquisition rule including adding special fields, adjusting field positions, combining fields, formatting fields and regular expressions, so as to form data which finally meets requirements, and the data is periodically imported into a database through data field mapping;
step 3, automatic acquisition system architecture and integrated application:
step 31, designing a distributed cloud framework acquisition system, wherein the whole system is deployed on a cloud platform and comprises a main program, a monitoring program, an acquisition rule configuration client, a distributed acquisition cluster and a storage cluster, and the main program deploys services including connection of the client, cloud nodes, cloud acquisition data and account information storage; the monitoring program provides service resource management, node resource management and task control and monitoring; the acquisition rule configuration client visually makes an acquisition rule flow by simulating manual webpage browsing operation, and acquisition rule configuration data are stored in a configuration database;
step 32, carrying out integrated application in the electric wave environment application system, carrying out seamless butt joint design on the electric wave environment application system directly deployed on the Internet and the electric wave environment web crawler acquisition system, and setting sampling frequencies including minutes, hours, days, weeks and months; for the electric wave environment application system deployed across the network, the electric wave environment web crawler acquisition system supports the export of various data formats including Excel, SQL, TXT and MySQL, and only needs to periodically launch data import or write a data integration interface program.
Further, in step 21, the step of collecting the single data is:
(a) creating a new task, inputting a website, respectively selecting fields needing to be collected, selecting and collecting texts of the element for the text fields, and selecting and collecting the picture addresses for the picture fields;
(b) editing the field, modifying the name of the field, and performing more operations on the field, including deletion, copying and formatting;
(c) and acquiring and exporting the data, running a configured acquisition task, selecting a proper export mode to export the data after the acquisition is finished, and supporting Excel, CSV and HTML.
Further, in step 21, the step of collecting the list data is:
(a) selecting a data list on a page, wherein the selected range needs to be the maximum and comprises all fields to be collected;
(b) selecting the sub-elements of the data list, prompting the selected sub-elements on an interface, finding the similar elements at the same time, and selecting 'all selected';
(c) after the selection of all fields is confirmed, the data acquisition is started, and the setting of 'cycle-extraction data' is completed.
Further, in step 21, the step of collecting the table data is:
(a) selecting a first cell of a first list on a page, selecting an expanded selection area in a prompt box, and selecting the expanded selection area to a whole line, wherein the specific field of each line is positioned and automatically identified;
(b) selecting 'selected child elements', prompting the finding of similar elements by an interface, selecting 'selected all', and selecting all form child elements in a page;
(c) all fields in the current rule default acquisition list are modified and deleted in a field editing function interface, and data acquisition is started.
Further, in step 21, the step of collecting page-turning multi-page data includes:
(a) firstly, establishing a 'circulation-data extraction' process to finish the acquisition of list data on a first page;
(b) and finding and clicking the next page in the page, and then selecting the loop in the prompt box to click the next page to complete the establishment of the page turning loop.
(c) Starting the collected data to complete the formulation of the collection rule of page turning and page spreading.
In a computer-readable storage medium having stored thereon a radio wave environment web crawler collection system program, the improvement comprising: when being executed by a processor, the electric wave environment web crawler acquisition system program realizes the steps of the open-source electric wave environment data acquisition method based on the web crawler.
The invention has the beneficial effects that:
the method disclosed by the invention can be used for researching and screening various data sources related to the electric wave environment, deeply analyzing the information such as the data type, the acquisition means, the acquisition time, the update frequency and the like, and bringing the data source with high association degree and high reliability with the electric wave environment into the object of electric wave environment network crawler data acquisition. And analyzing the open source websites in sequence, finishing the classification and grading of the open source websites in the electric wave environment, and formulating data crawling strategies of websites with different grades. According to the layout classification of open source website data on a webpage, different data acquisition rules of single data, list data, table data, page-turning multi-page data and the like are designed, and meanwhile, special rules are adopted for processing data sources which need to be logged in and verified and adopt ajax new technology and the like and have higher difficulty. On the basis, a distributed cloud framework acquisition system of the electric wave environment web crawler is built, and webpage analysis, data acquisition, data cleaning and data storage of various fragmented information of the electric wave environment open source website are integrally completed. For the electric wave environment application systems deployed in the same network and across networks, full-automatic and semi-automatic docking schemes are designed, unattended operation and efficient operation and maintenance of the application systems are realized, and a foundation is laid for improving the electric wave environment information guarantee capability in the big data era.
The method disclosed by the invention utilizes the data acquired by the radio wave environment network crawler acquisition system as effective supplement of the observation data of the national defense science and technology industrial radio wave observation station network, develops the environmental guarantee outside the coverage area of the observation station network, and solves the problem that the radio wave environment observation station network has relatively weak capability of observing, forecasting and guaranteeing the service of the radio wave environment of the military hot spot area around China. The data product formed by the research result can be used for users in the industry fields of communication/broadcasting, detection/detection, navigation positioning, remote sensing and the like, and provides data support and application guarantee for planning demonstration, design development, service application and the like of related electronic information systems.
Drawings
FIG. 1 is a schematic flow chart of the method disclosed in example 1 of the present invention;
FIG. 2 is a technical route diagram of the method disclosed in example 1 of the present invention;
FIG. 3 is a schematic diagram of a single data collection rule and a collection field configuration in the method disclosed in embodiment 1 of the present invention;
FIG. 4 is a schematic diagram of a configuration of a list data collection rule and a collection field in the method disclosed in embodiment 1 of the present invention;
FIG. 5 is a schematic diagram of a cyclic collection rule configuration of the form data in the method disclosed in embodiment 1 of the present invention;
FIG. 6 is a schematic diagram of a configuration of a table data collection field in the method disclosed in embodiment 1 of the present invention;
FIG. 7 is a diagram illustrating a configuration of page-turning multi-page data collection rules in the method disclosed in embodiment 1 of the present invention;
FIG. 8 is a schematic diagram of data being collected by the method disclosed in example 1 of the present invention;
fig. 9 is a schematic diagram of configuration of ajax load data in the method disclosed in embodiment 1 of the present invention;
FIG. 10 is a schematic diagram of a data export configuration in the method disclosed in embodiment 1 of the present invention;
fig. 11 is a schematic system architecture diagram of an integrated application of the electric wave environment web crawler acquisition system and the electric wave environment application system in the method disclosed in embodiment 1 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
Embodiment 1, this embodiment discloses a method for collecting open-source radio wave environment data based on a web crawler, as shown in fig. 1 and 2, including the following steps:
step 1, open source data investigation and webpage characteristic analysis:
step 11, carrying out investigation and analysis of open source electric wave environment related data, carrying out investigation and screening on a plurality of electric wave environment related data sources, preferably selecting a data source which can be used for electric wave environment situation analysis and has high correlation degree with the electric wave environment situation analysis, deeply analyzing information such as data type, acquisition means, acquisition time, updating frequency and the like, carrying out quality analysis on the electric wave environment information source by means of an internet evaluation tool, and bringing the data source with high correlation degree and high reliability with the electric wave environment into an object for electric wave environment network crawler data acquisition;
step 12, researching the distribution characteristics and the change rules of the open source electric wave environment web pages, sequentially analyzing tens of collected electric wave environment open source websites, mainly performing statistical analysis on the aspects of influencing the design and implementation of a web crawler, such as the structural composition of data to be collected, whether login and verification codes are needed, whether ajax new technology is adopted and the like, completing the classification and grading of the electric wave environment open source websites, and formulating data crawling strategies of websites with different grades;
step 2, designing a web crawler:
step 21, customizing an acquisition rule, classifying the electric wave environment data to be acquired according to different layouts on a webpage, and configuring the acquisition rule for website data with different design styles, such as single data, list data, form data, page-turning multi-page data and the like one by one to form an acquisition task;
as shown in fig. 3, a single data is collected, for example, http:// stereo-ssc. nascom. nasa. gov/beacon/beacon _ insitu. shtml data, there are many fields on a web page: text (IMPACT/PLASTIC solar wind data, IMPACT solar electronic data, SWAVES radio data), picture (data corresponding to picture), time, etc. The fields are collected and stored as structured data, and the main operation flow is as follows:
(a) creating a new task, inputting a website, respectively selecting fields needing to be collected, selecting and collecting texts of the element for the text fields, and selecting and collecting the picture addresses for the picture fields;
(b) editing the field, modifying the name of the field, and performing more operations on the field, including deletion, copying, formatting and the like;
(c) and acquiring and exporting the data, running a configured acquisition task, selecting a proper export mode to export the data after the acquisition is finished, and supporting Excel, CSV and HTML.
As shown in fig. 4, tabular data was collected, for example https:// ui. addabs. harvard. edu/abs/2004A% 26a.. 425.1097R/coordinates data, having many structurally identical data lists, each with identical fields: serial number, article number, time, article title, author, etc. The fields in the lists on the webpage are sequentially collected according to the webpage arrangement sequence, and the key work is to establish a 'circulation-data extraction' process.
(a) Selecting a data list on a page, wherein the selected range needs to be the maximum and comprises all fields to be collected;
(b) selecting the sub-elements of the data list, prompting the selected sub-elements on an interface, finding the similar elements at the same time, and selecting 'all selected';
(c) after the selection of all fields is confirmed, the data acquisition is started, and the setting of 'cycle-extraction data' is completed.
As shown in fig. 5 and 6, table data is collected,
taking https:// cdaw.gsfc.nasa.gov/CME _ list/UNIVERSAL/1996_01/univ1996_01.html data as an example, a table is arranged on a webpage, the structure is very neat, each piece of information occupies one line of the table, and each line comprises a plurality of pieces of field information: first C2 application Date Time, Central PA, Angular Width, Linear Speed, 2nd-order Speed at final height, 2nd-order Speed at 20 Rs, Accel, Mass, Kinetic Energy, MPA, etc. The fields are collected and stored in the form of Excel and the like, and the key work is also to establish a 'loop-extraction data' process.
(a) Selecting a first cell of a first list on a page, selecting an expanded selection area in a prompt box, and selecting the expanded selection area to a whole line, wherein the specific field of each line is positioned and automatically identified;
(b) selecting 'selected child elements', prompting the finding of similar elements by an interface, selecting 'selected all', and selecting all form child elements in a page;
(c) all fields in the current rule default acquisition list are modified and deleted in a field editing function interface, and data acquisition is started.
As shown in fig. 7, page-turning and page-turning data is collected, and similarly, taking https:// ui. addabs. harvard. edu/abs/2004A% 26a.. 425.1097R/locations data as an example, a single page of the web page is list data, but a page needs to be turned backwards to complete the collection of all data, and the main operation flow is as follows:
(a) firstly, establishing a 'circulation-data extraction' process to finish the acquisition of list data on a first page;
(b) and finding and clicking the next page in the page, and then selecting the loop in the prompt box to click the next page to complete the establishment of the page turning loop.
(c) Starting the collected data to complete the formulation of the collection rule of page turning and page spreading.
And step 22, designing a special website data acquisition rule, configuring the rule by a browser mode, memorizing Cookie or inputting an account number, inputting a password and clicking login for a website needing login and a verification code, wherein the verification code is needed by the individual website, and the verification code identification control is used to realize automatic code scanning. Aiming at a website adopting the new ajax technology, ajax setting is needed, ajax timeout time is set, and acquisition rules are configured one by one according to website data with different design styles, such as single data, list data, table data, page-turning multi-page data and the like to form an acquisition task. FIG. 8 is a schematic diagram of data being collected; FIG. 9 is a schematic diagram of an ajax load data configuration;
and step 23, extracting and warehousing the data, configuring the acquisition rule, and simultaneously optimizing the acquisition rule, wherein the acquisition rule comprises adding special fields, adjusting field positions, combining the fields, formatting the fields, performing regular expressions and the like to form data which finally meet requirements, and the data can be periodically imported into a database through data field mapping.
Step 3, automatic acquisition system architecture and integrated application:
step 31, designing a distributed cloud framework acquisition system, wherein the whole system is deployed on a cloud platform and comprises a main program, a monitoring program, an acquisition rule configuration client, a distributed acquisition cluster and a storage cluster, and the main program is deployed and connected with multiple services such as the client, a cloud node, cloud acquisition data, account information storage and the like; the monitoring program provides functions of service resource management, node resource management, task control, monitoring and the like; the acquisition rule configuration client visually makes an acquisition rule flow by simulating manual webpage browsing operation, and acquisition rule configuration data are stored in a configuration database.
And step 32, carrying out integrated application in the electric wave environment application system, and carrying out full-automatic and semi-automatic butt joint design on the electric wave environment application system deployed in the same network and the cross-network with the electric wave environment web crawler acquisition system respectively. Specifically, a radio wave environment application system directly deployed on the internet and a radio wave environment web crawler acquisition system are seamlessly butted, various sampling frequencies such as minutes, hours, days, weeks and months are set, automatic data acquisition and warehousing are achieved, and the application system runs unattended. For the radio wave environment application system deployed across the network, the radio wave environment web crawler acquisition system supports export of multiple data formats such as Excel, SQL, TXT, MySQL and the like, and a user can realize efficient operation and maintenance of the radio wave environment application system only by regularly carrying out data import or simply writing a data integration interface program. FIG. 10 is a schematic diagram of a data export configuration; fig. 11 is a schematic diagram of a system architecture of an application of the radio wave environment web crawler collection system and a radio wave environment application system.
The embodiment also discloses a computer readable storage medium, on which a radio wave environment web crawler collection system program is stored, and when the radio wave environment web crawler collection system program is executed by a processor, the steps of the open-source radio wave environment data collection method based on the web crawler are implemented.

Claims (6)

1. A method for collecting open-source electric wave environment data based on web crawlers is characterized by comprising the following steps:
step 1, open source data investigation and webpage characteristic analysis:
step 11, carrying out investigation and analysis of open source electric wave environment related data, carrying out investigation and screening on a plurality of electric wave environment related data sources, preferably selecting a data source which can be used for electric wave environment situation analysis and has high correlation degree with the electric wave environment situation analysis, deeply analyzing the data type, the acquisition means, the acquisition time and the update frequency, carrying out quality analysis of an electric wave environment information source by means of an internet evaluation tool, and bringing the data source with high correlation degree and high reliability with the electric wave environment into an object for electric wave environment network crawler data acquisition;
step 12, researching the distribution characteristics and the change rules of open source electric wave environment webpages, sequentially analyzing the collected electric wave environment open source websites, mainly performing statistical analysis on the structure composition of data to be collected, whether login and verification codes are needed, whether ajax new technology is adopted, completing classification and grading of the electric wave environment open source websites, and formulating data crawling strategies of websites with different grades;
step 2, designing a web crawler:
step 21, customizing an acquisition rule, classifying the electric wave environment data to be acquired according to different layouts on a webpage, and configuring the acquisition rule for different design style website data of single data, list data, form data and page-turning multi-page data one by one to form an acquisition task;
step 22, designing a special website data acquisition rule, configuring the rule by a browser mode, memorizing Cookie or inputting an account number, inputting a password and clicking to login to form an acquisition task of a website needing to be logged in and a verification code, and setting a timeout time configuration rule by ajax to form an acquisition task of a website adopting ajax new technology;
step 23, extracting and warehousing data, wherein data extraction is realized through an optimized configuration acquisition rule including adding special fields, adjusting field positions, combining fields, formatting fields and regular expressions, so as to form data which finally meets requirements, and the data is periodically imported into a database through data field mapping;
step 3, automatic acquisition system architecture and integrated application:
step 31, designing a distributed cloud framework acquisition system, wherein the whole system is deployed on a cloud platform and comprises a main program, a monitoring program, an acquisition rule configuration client, a distributed acquisition cluster and a storage cluster, and the main program deploys services including connection of the client, cloud nodes, cloud acquisition data and account information storage; the monitoring program provides service resource management, node resource management and task control and monitoring; the acquisition rule configuration client visually makes an acquisition rule flow by simulating manual webpage browsing operation, and acquisition rule configuration data are stored in a configuration database;
step 32, carrying out integrated application in the electric wave environment application system, carrying out seamless butt joint design on the electric wave environment application system directly deployed on the Internet and the electric wave environment web crawler acquisition system, and setting sampling frequencies including minutes, hours, days, weeks and months; for the electric wave environment application system deployed across the network, the electric wave environment web crawler acquisition system supports the export of various data formats including Excel, SQL, TXT and MySQL, and only needs to periodically launch data import or write a data integration interface program.
2. The open-source electric wave environment data acquisition method based on the web crawler according to claim 1, wherein in step 21, the step of acquiring single data comprises:
(a) creating a new task, inputting a website, respectively selecting fields needing to be collected, selecting and collecting texts of the element for the text fields, and selecting and collecting the picture addresses for the picture fields;
(b) editing the field, modifying the name of the field, and performing more operations on the field, including deletion, copying and formatting;
(c) and acquiring and exporting the data, running a configured acquisition task, selecting a proper export mode to export the data after the acquisition is finished, and supporting Excel, CSV and HTML.
3. The open-source radiowave environment data collection method based on web crawlers according to claim 1, characterized in that in step 21, the step of collecting list data is:
(a) selecting a data list on a page, wherein the selected range needs to be the maximum and comprises all fields to be collected;
(b) selecting the sub-elements of the data list, prompting the selected sub-elements on an interface, finding the similar elements at the same time, and selecting 'all selected';
(c) after the selection of all fields is confirmed, the data acquisition is started, and the setting of 'cycle-extraction data' is completed.
4. The open-source radiowave environment data collection method based on web crawler according to claim 1, wherein in step 21, the step of collecting table data is:
(a) selecting a first cell of a first list on a page, selecting an expanded selection area in a prompt box, and selecting the expanded selection area to a whole line, wherein the specific field of each line is positioned and automatically identified;
(b) selecting 'selected child elements', prompting the finding of similar elements by an interface, selecting 'selected all', and selecting all form child elements in a page;
(c) all fields in the current rule default acquisition list are modified and deleted in a field editing function interface, and data acquisition is started.
5. The open-source radiowave environment data collection method based on the web crawler according to claim 1, wherein in step 21, the step of collecting page-turning multi-page data comprises:
(a) firstly, establishing a 'circulation-data extraction' process to finish the acquisition of list data on a first page;
(b) and finding and clicking the next page in the page, and then selecting the loop in the prompt box to click the next page to complete the establishment of the page turning loop.
(c) Starting the collected data to complete the formulation of the collection rule of page turning and page spreading.
6. A computer-readable storage medium on which a radio wave environment web crawler collection system program is stored, characterized in that: the electric wave environment web crawler collecting system program realizes the steps of the open source electric wave environment data collecting method based on the web crawler in any one of claims 1 to 5 when being executed by a processor.
CN202110522150.9A 2021-05-13 2021-05-13 Open-source electric wave environment data acquisition method based on web crawler and computer readable storage medium Active CN113297448B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110522150.9A CN113297448B (en) 2021-05-13 2021-05-13 Open-source electric wave environment data acquisition method based on web crawler and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110522150.9A CN113297448B (en) 2021-05-13 2021-05-13 Open-source electric wave environment data acquisition method based on web crawler and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN113297448A true CN113297448A (en) 2021-08-24
CN113297448B CN113297448B (en) 2022-10-25

Family

ID=77321979

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110522150.9A Active CN113297448B (en) 2021-05-13 2021-05-13 Open-source electric wave environment data acquisition method based on web crawler and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN113297448B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114860875A (en) * 2022-04-26 2022-08-05 深圳市生态环境智能管控中心 Data integration system and method for fixed pollution source

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060206584A1 (en) * 2005-03-11 2006-09-14 Yahoo! Inc. System and method for listing data acquisition
CN106134454B (en) * 2011-05-10 2014-08-27 中国电子科技集团公司第二十二研究所 Electric wave environment comprehensive monitoring warning device
CN104135516A (en) * 2014-07-29 2014-11-05 浪潮软件集团有限公司 Distributed cloud storage method based on industry data acquisition
CN106055618A (en) * 2016-05-26 2016-10-26 优品财富管理有限公司 Data processing method based on web crawlers and structural storage
CN107577748A (en) * 2017-08-30 2018-01-12 成都中建科联网络科技有限公司 Building trade information acquisition system and its method based on big data
CN109543086A (en) * 2018-11-23 2019-03-29 北京信息科技大学 A kind of network data acquisition and methods of exhibiting towards multi-data source
CN110765402A (en) * 2019-10-31 2020-02-07 同方知网(北京)技术有限公司 Visual acquisition system and method based on network resources
CN112148807A (en) * 2020-09-28 2020-12-29 中国电波传播研究所(中国电子科技集团公司第二十二研究所) Electromagnetic environment field data warehouse construction method
CN112417239A (en) * 2019-08-21 2021-02-26 京东方科技集团股份有限公司 Webpage data crawling method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060206584A1 (en) * 2005-03-11 2006-09-14 Yahoo! Inc. System and method for listing data acquisition
CN106134454B (en) * 2011-05-10 2014-08-27 中国电子科技集团公司第二十二研究所 Electric wave environment comprehensive monitoring warning device
CN104135516A (en) * 2014-07-29 2014-11-05 浪潮软件集团有限公司 Distributed cloud storage method based on industry data acquisition
CN106055618A (en) * 2016-05-26 2016-10-26 优品财富管理有限公司 Data processing method based on web crawlers and structural storage
CN107577748A (en) * 2017-08-30 2018-01-12 成都中建科联网络科技有限公司 Building trade information acquisition system and its method based on big data
CN109543086A (en) * 2018-11-23 2019-03-29 北京信息科技大学 A kind of network data acquisition and methods of exhibiting towards multi-data source
CN112417239A (en) * 2019-08-21 2021-02-26 京东方科技集团股份有限公司 Webpage data crawling method and device
CN110765402A (en) * 2019-10-31 2020-02-07 同方知网(北京)技术有限公司 Visual acquisition system and method based on network resources
CN112148807A (en) * 2020-09-28 2020-12-29 中国电波传播研究所(中国电子科技集团公司第二十二研究所) Electromagnetic environment field data warehouse construction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
彭帆: "云数据采集系统中管理子系统的设计与实现", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114860875A (en) * 2022-04-26 2022-08-05 深圳市生态环境智能管控中心 Data integration system and method for fixed pollution source
CN114860875B (en) * 2022-04-26 2023-06-20 深圳市生态环境智能管控中心 Data integration system and method for fixed pollution source

Also Published As

Publication number Publication date
CN113297448B (en) 2022-10-25

Similar Documents

Publication Publication Date Title
CN109543086B (en) Network data acquisition and display method oriented to multiple data sources
CN103092950B (en) A kind of network public-opinion geographic position real-time monitoring system and method
Flemons et al. A web-based GIS tool for exploring the world's biodiversity: The Global Biodiversity Information Facility Mapping and Analysis Portal Application (GBIF-MAPA)
Du et al. Research development on sustainable urban infrastructure from 1991 to 2017: a bibliometric analysis to inform future innovations
CN103577581B (en) Agricultural product price trend forecasting method
CN1963816A (en) Automatization processing method of rating of merit of search engine
CN102722558A (en) User question recommending method and device
CN101794277B (en) Method for embedding geographical labels in network character information and system
CN101639852A (en) Method and system for sharing distributed geoscience data
Liu et al. Visualized analysis of knowledge development in green building based on bibliographic data mining
CN113297448B (en) Open-source electric wave environment data acquisition method based on web crawler and computer readable storage medium
Li et al. Retrospective research on the interactions between land-cover change and global warming using bibliometrics during 1991–2018
CN101105797A (en) Form locating data mining method
CN102156749A (en) Anatomic search and judgment method, system and distributed server system for map sites
Medyckyj-Scott et al. Discovering environmental data: metadatabases, network information resource tools and the GENIE system
Yang et al. A topic-specific web crawler with concept similarity context graph based on FCA
Khan et al. Self-adaptive ontology-based focused crawling: a literature survey
Yang Research on the Construction of Intelligent Learning System Based on Big Data
Xiong et al. Automated construction technology of the government agencies knowledge graph based on the topical crawler
He et al. A study on evaluation of farmland fertility levels based on optimization of the decision tree algorithm
Jayawardana et al. Modeling updates of scholarly webpages using archived data
Wang et al. The Spatial Distribution Dataset on Ecological Agriculture Patterns of China (2018–2020)
Djajadi et al. Cleansing and Transforming Solarmanpv Datasets for Visually Interactive Disclosure of Covert Information of Solar Power Plants Worldwide
Liang et al. The Design and Development of the Land Management System in Dingzhuang Town Based on Spatial Data
Shi et al. The design and implementation of opinion extraction system based on distributed network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant