CN111859067A - Hydrological water quality data acquisition method and system based on web crawler technology - Google Patents

Hydrological water quality data acquisition method and system based on web crawler technology Download PDF

Info

Publication number
CN111859067A
CN111859067A CN202010613400.5A CN202010613400A CN111859067A CN 111859067 A CN111859067 A CN 111859067A CN 202010613400 A CN202010613400 A CN 202010613400A CN 111859067 A CN111859067 A CN 111859067A
Authority
CN
China
Prior art keywords
data
water quality
hydrological
module
site
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010613400.5A
Other languages
Chinese (zh)
Inventor
谢天奕
王永桂
李强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN202010613400.5A priority Critical patent/CN111859067A/en
Publication of CN111859067A publication Critical patent/CN111859067A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a hydrological water quality data acquisition system and a hydrological water quality data acquisition method based on a web crawler technology, wherein the system and the method carry out real-time monitoring on water quality and hydrological data under a target website through a webpage monitoring reminding plug-in; in the monitoring process, a network crawler technology is adopted, after a JSON file for subsequent water quality data analysis is obtained for the monitored water quality data, the encoded JSON character string in the JSON file is decoded into a Python object based on the decoding of the Python object, and then the water quality data is obtained; for the monitored hydrological data, after crawling an xml file comprising webpage content, selecting a node or a node set in the xml file through an XPath grammar, and analyzing the webpage content to obtain the hydrological data; after a connection session between the server and the user is established, hydrologic and water quality data collected in the data collection module are retrieved according to preset retrieval conditions, analysis conditions are set based on retrieval results, and hydrologic and water quality analysis results are output.

Description

Hydrological water quality data acquisition method and system based on web crawler technology
Technical Field
The invention relates to the field of data acquisition and analysis, in particular to a hydrological water quality data acquisition method and system based on a web crawler technology.
Background
With the application of various information systems, a long-term neglected problem gradually floats out of the water, which is a problem of data acquisition. Nowadays, the data collection problem has become a bottleneck and a short board limiting the performance of various information systems, and how to collect data quickly and effectively becomes a key point of attention of people.
With the development of the internet, people can acquire knowledge to be acquired through a certain means in huge network information. The knowledge required to be taken is different for different data individuals, and the phenomenon greatly increases the difficulty of acquiring target information, so that the concept of the Web crawler is brought forward, the Web crawler has strong specialty, can effectively query numerous Web pages, and can capture and store effective information in massive internet information.
At present, the web crawler technology is a tool for solving the problem of data acquisition, and is applied to various information systems. Wherein, with patent "invention name: a method for extracting data based on medical system crawlers, patent publication No.: CN111078976A "is an example, and this invention discloses a web crawler technology based medical data such as medical images, related pathological parameters, laboratory and measurement results, diagnosis records and related parameter bases (age, sex, medical history, time of hospital admission) of patients are crawled from medical system databases.
The hydrological water quality data is extremely important to the field of water ecology and is important basic data for developing water ecological environment protection planning, treatment and restoration, but on the one hand, the information only publishes and publishes real-time data, and does not publicize historical data, and long-sequence historical data is indispensable data for scientific research and project planning. On the other hand, hydrological water quality data is like a treasure house with water conservancy information networks at all levels across the country, but few people specially collect and manage the hydrological water quality data.
At present, under the wide application of a crawler technology, no related technology discloses the acquisition and application of hydrological water quality data by utilizing a web crawler technology, and the existing web crawler technology or system only has the function of crawling web data, lacks the function of organizing the crawled data and providing services, and is difficult to be applied in a large range in the industry.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a system and a method for crawling the water quality monitoring data and providing data service based on the web crawler technology, aiming at the defect that the prior art is lack of management and subsequent service for the crawled data.
The technical scheme adopted by the invention for solving the technical problems is as follows: a hydrology and water quality data acquisition system based on a web crawler technology is constructed, and comprises the following modules:
The data acquisition module is used for monitoring the water quality and the hydrological data under the target website in real time through the webpage monitoring reminding plug-in; in the monitoring process, a network crawler technology is adopted, after a JSON file for subsequent water quality data analysis is obtained for the monitored water quality data, the encoded JSON character string in the JSON file is decoded into a Python object based on the decoding of the Python object, and then the water quality data is obtained; for the monitored hydrological data, after crawling an xml file comprising webpage content, selecting a node or a node set in the xml file through an XPath grammar, and analyzing the webpage content to obtain the hydrological data;
and the data processing module is used for searching the hydrological and water quality data collected in the data collection module according to preset searching conditions after a connection session between the server and the user is established, setting analysis conditions based on the searching results and outputting hydrological and water quality analysis results.
The invention discloses a hydrological water quality data acquisition method of a hydrological water quality data acquisition system based on a web crawler technology, which comprises the following steps:
s1, under the data acquisition module, the plug-in is reminded through webpage monitoring to perform real-time monitoring on water quality and hydrological data under a target website; in the monitoring process, a network crawler technology is adopted, after a JSON file for subsequent analysis of water quality data is obtained by crawling the monitored water quality data, the encoded JSON character string in the JSON file is decoded into the Python object based on the decoding of the Python object, and the water quality data is obtained; for the monitored hydrological data, after crawling an xml file comprising webpage content, selecting a node or a node set in the xml file through an XPath grammar, and analyzing the webpage content to obtain the hydrological data;
And S2, after the connection session between the server and the user is established, searching the hydrologic and water quality data collected in the data collection module according to preset searching conditions under the data processing module, setting analysis conditions based on the searching results, and outputting hydrologic and water quality analysis results.
In the method and the system for acquiring the hydrological water quality data based on the web crawler technology, the online monitoring probe and the real-time deduplication algorithm are adopted for the target to be crawled, the dynamic change data of the hydrological water quality network can be timely and accurately crawled, the fact that the crawled data are not missed, not repeated and accurate is guaranteed, data query and analysis services are provided based on the crawled data, and the method and the system have important significance for hydrological water quality management departments and operation departments.
The implementation of the hydrographic water quality data acquisition method and the hydrographic water quality data acquisition system based on the web crawler technology has the following beneficial effects:
1. a dynamic crawler tool which can adapt to different levels of hydrographic water quality information publishing systems is formed by adopting a focusing crawler technology, and active collection, automatic warehousing and analysis of information published by the hydrographic water quality system are realized.
2. Through the technical scheme disclosed by the invention, data service is provided for the outside, and support is provided for analysis and display based on the data of the system so as to serve the commercial application of relevant tools of the system.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a schematic structural diagram of a hydrographic water quality data acquisition system based on web crawler technology disclosed by the invention;
fig. 2 is an execution flow chart of the hydrographic water quality data acquisition method based on the web crawler technology disclosed by the invention.
Detailed Description
For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
Example 1:
please refer to fig. 1, which is a schematic structural diagram of a system for acquiring hydrographic water quality data based on web crawler technology according to the present invention, the system includes the following data acquisition modules and data processing modules, wherein:
(1) the data acquisition module (namely a data crawling end shown in figure 1) is used for monitoring the water quality and the hydrological data under the target website in real time through the webpage monitoring reminding plug-in; in the monitoring process, a web crawler technology is adopted, after a json file for subsequent water quality data analysis is obtained for the monitored water quality data, the coded json character string in the json file is decoded into the Python object based on the decoding of the Python object, and the water quality data is obtained;
And for the monitored hydrological data, after crawling the xml file including the webpage content, selecting a node or a node set in the xml file through an XPath grammar, and analyzing the webpage content to obtain the hydrological data.
It should be noted that the data crawling operation under the data acquisition module is mainly implemented based on a web crawler module arranged in the data acquisition module (the system structure diagram of the data acquisition module may refer to the schematic structural diagram of the data crawling part in fig. 1);
in this embodiment, the data acquisition module further includes a Web Monitor module, where the Web Monitor module is configured to check target website data at intervals (which may be set by itself according to actual conditions) through a Distill Web Monitor-Web monitoring prompt plug-in unit under a browser, and the target website includes a national water rain condition website and a national surface water quality website; wherein:
for the national water rain condition website, as the website page is not dynamically changed after being loaded, the whole page can be directly monitored through the Distill Web Monitor plug-in;
for the national surface water website, since the JavaScript data rolling display special effect can be generated after the website is loaded, in order to avoid generating misjudgment, under the current embodiment, the distillWeb Monitor plug-in can effectively eliminate the influence of the data rolling characteristic on the monitoring result.
Based on monitored water quality data, in order to enable a target crawling site to be effectively identified when a web crawler module is used, in this embodiment, after a URL of the target website is analyzed by using a packet capture analysis tool (based on different operating system implementation scenarios, the packet capture analysis tool that can be used includes tcpdump, wirehardk, tcpflow, httpwatch or a browser self-contained packet capture tool), the web crawler module sends a post request through a requests library in Python, for example, an execution code of the web crawler module is:
post (URL of target site), data;
print(r.content);
after a json file for water quality data analysis in a later period is obtained (i.e. json character strings containing a plurality of items of water quality data are obtained), decoding the coded json character strings into Python objects through a json library to obtain water quality data;
based on the monitored hydrologic data, the web crawler module is combined with a crawler frame of Python + Selenium + Chrome to obtain an xml file including webpage contents, the xml file includes a plurality of webpage Elements, and the webpage contents in the xml file are analyzed through an XPath grammar to obtain the hydrologic data.
(2) The data processing module (i.e., the data server shown in fig. 1) is configured to, after a connection session between the server and the user is established, retrieve the hydrologic and water quality data acquired by the data acquisition module according to preset retrieval conditions, set analysis conditions based on the retrieval results, and output hydrologic and water quality analysis results.
It should be noted that the search conditions according to the preset setting include:
based on a request sent by a browser, searching is carried out based on the crawled data according to conditions such as station names, time and the like, preliminary calculation is carried out on the data, such as calculating average values of days, months, years and the like, or the crawled hydrological water quality station is matched with information of national hydrological water quality stations, and specific geographic coordinates of the crawled station are obtained.
In order to provide an intelligent query processing service, a data query module, a data analysis module, a data display module and a data export module are further arranged in the data processing module (refer to the schematic structural diagram of the data server in fig. 1); the result after the initial calculation based on the retrieval condition can be transmitted to other functional modules of the data processing module for query, analysis, display or export.
The functional introduction of the 4 sub-modules included in the data processing module specifically includes:
the data analysis module is used for automatically analyzing the water quality types and the standard exceeding concentrations of different water quality indexes of each water quality site, and the flow, the water level geometric proportion and the ring ratio of each hydrological site after reading the hydrological water quality data crawled by the data acquisition module;
The water quality data analysis comprises water quality category analysis, water quality category proportion analysis, water quality index analysis, single item of item analysis of exceeding water quality and category analysis of exceeding water quality on the basis of the ' surface water environment quality standard ' (GB3838-2002) '; wherein the water quality types are I type, II type, III type, IV type, V type and inferior V type; the hydrologic data analysis comprises the steps of setting time ranges (months, years and the like) for the monitoring data, and calculating and analyzing the monitoring data to obtain flow, water level geometric proportion and ring ratio.
The data query module and the data display module are used for providing hydrological and water quality information service for target users, providing functions of downloading, querying, counting, analyzing and the like of hydrological and water quality data for service workers and visually displaying the hydrological and water quality data;
it should be noted that, when the analysis data is displayed, the distribution situation of the hydrological and water quality sites can be displayed through a map based on the GIS technology; the time change of different indexes of a hydrological water quality site, the time change of water quality categories and the proportion of different water quality categories per month can be displayed through a chart based on echarts (an open source visual library realized by JavaScript); and the GIS technology and echarts can also be combined, after the water quality types, the overproof concentration and other indexes of different water quality indexes of the water quality site are displayed on a map, a chart is added on the map to display the flow, water level comparison and cyclic ratio change processes of the hydrological site, or the analysis results are displayed in a ranking mode.
The data export module is used for providing various data export services, including the export of hydrology and water quality data through map selection or frame selection; or the hydrological water quality data is exported through table retrieval; or exporting the data into formats such as a text file such as pdf, csv or txt, an excel file, a word file and the like.
In this embodiment, a dynamic crawler tool adaptable to different levels of hydrographic water quality information publishing systems is formed by adopting a focused crawler technology, so that the hydrographic water quality system publishing information is actively collected, automatically stored and analyzed; and provide data query and analysis service based on the crawled data, and have important significance for hydrology and water quality management departments and working departments.
Example 2:
based on embodiment 1, in order to accurately crawl the water quality network dynamic change data of the water, the data acquisition module is further provided with a mailbox monitor module (i.e. the online monitoring probe shown in fig. 1, which includes a webpage monitor module and a mailbox monitor module) working in conjunction with the webpage monitor module, wherein:
when the webpage monitor module is used for judging that data updating exists in a target website, sending crawling reminding information to a mailbox monitor module through an email; and the mailbox monitor module reads all mails of the mailbox and returns a mail list, the number of the mails is recorded when the mails are read, when the newly added mails are judged to exist, the contents corresponding to the newly added mails in the mail list are read, whether data updating exists in the target website is determined according to the crawling reminding information, if yes, the web crawler module is driven to crawl the data, and if not, the content analysis of the newly added mails is continued.
It should be noted that, the determining, by using the web monitor module, whether the target website has data update specifically includes:
when monitoring water quality data, confirming the measurement time of the water quality data through an Xpath grammar, and sending crawling reminding information to a mailbox monitor module through a mail when the measurement time of the monitored water quality data is changed;
when hydrologic data monitoring is carried out, the whole page is directly monitored through a Distill Web Monitor plug-in, and whether the data are updated or not can be judged by adopting a regular expression or other data comparison methods based on the monitored data; it should be noted that the method can also be used for updating and judging the water quality monitoring data.
It should be noted that, when the mailbox monitor module reads the mail, the implementable steps include:
s101, logging in a mailbox through a server function of a zmail module (which is a functional module used for sending and receiving mails in python);
s102, reading all mails of the mailbox through a get _ mails function, returning a mail list and recording the number of the mails.
In the present embodiment, in order to improve the execution efficiency of the algorithm, at the present time, if the increment of the mail number is 1, according to the data storage structure (including first-in first-out or last-in first-out) of the mail list, the conceivable reading method is: if the data storage is performed in the mail list according to a first-in first-out queue storage manner, the mail with the mail list index value of 0 is read, and when the mail content is read, the mail content is analyzed through a regular expression (it should be noted that other embodiments that can be used for data analysis are also applicable to this step), so as to achieve the purposes of filtering junk mails and judging whether the target website data is updated.
It should be noted that, when it is determined that the target website has data update according to the crawling reminding information, the web crawler module is driven to crawl the data, and the following steps can be considered:
and the web crawler module crawls corresponding website contents after receiving a data crawling command sent by the mailbox monitor module.
It should be noted that, in order to improve the resource utilization rate, when implementing the technical solution disclosed in this embodiment, it may be considered that the mailbox monitoring module is started periodically, including calling a timing task scheduling frame in Python, and scheduling the mailbox monitoring module to receive and analyze the target mail within a certain time according to the user-defined timing time.
In the hydrology water quality data acquisition system based on web crawler technology disclosed under this embodiment, adopt the online monitoring probe to the target of waiting to crawl, can be timely accurate crawl water quality network dynamic change data, ensure that the data of crawling do not omit and the accuracy.
Example 3:
as the amount of the crawled data is large, the phenomenon of difficult data management is inevitable, in this embodiment, a data storage module is arranged in the hydrological water quality data acquisition system (please refer to fig. 1);
When storing the data crawled based on the data acquisition module, the data storage module can call a pymongo library in the Python to store hydrological and water quality data.
It should be noted that a plurality of independent databases can be supported in the pymongo library, and when the stored data is called, the databases can be acquired in a point-to-point attribute manner or in a dictionary manner; when hydrological and water quality data storage is performed based on a pymongo library, the data are stored in a bjson type, and in order to classify and process the data, a site information establishing module is arranged in the data storage module (refer to a database end structure schematic diagram in fig. 1); the station information module is used for establishing a national hydrological water quality station table, a hydrological station table, a water quality station table, a hydrological data table and a water quality data table by adopting a bjson format; wherein:
the national hydrological water quality site information table is used for storing national hydrological water quality site names and geographic coordinates;
the hydrological site table is used for storing the crawled hydrological site names;
the water quality site table is used for storing the name of the crawled water quality site;
the hydrological data table is used for storing a plurality of hydrological elements obtained by crawling;
The water quality data table is used for storing a plurality of items of water quality factors obtained by crawling;
the hydrology and water quality data tables comprise station measuring time.
It should be noted that the hydrological factors include a drainage basin, an administrative district, a river name, a station name, measurement time, a water level, a flow rate, a warning water level, and the like; the plurality of water quality factors comprise sites, section names, measuring time, pH, dissolved oxygen, ammonia nitrogen, potassium permanganate index, total organic carbon, water quality types, section attributes, site conditions and the like;
in order to ensure that the crawled data is not missed, not repeated and accurate, the data acquisition module further comprises a deduplication module (namely, a data deduplication and storage module shown in fig. 1);
the duplication eliminating module is used for traversing data after acquiring hydrology and water quality data tables from the station information establishing module, and the data traversing process comprises the following steps:
inquiring the site measurement time of the traversed data, and judging whether the site measurement time in the current traversal process is the same as that in the previous traversal process; specifically, the method comprises the following steps:
if the data under the corresponding site are the same, determining that the data under the corresponding site are crawled;
if the data under the corresponding site are not the same, determining that the data under the corresponding site are not crawled, storing the currently traversed data in a data storage module on one hand, inquiring a hydrological and water quality site table on the other hand, judging whether the currently traversed site is recorded in the corresponding site table or not, adding crawl data into the data table after determining the data table corresponding to the judgment object for the judgment object recorded in the site table, adding a site name into the site table corresponding to the judgment object for the judgment object not recorded in the site table, and creating the corresponding data table, wherein the crawl data are stored in the currently created data table.
It should be noted that, in the process of traversing the crawled data, whether the data of the site has been crawled can be judged by inquiring the last measurement time of the site of the traversed data and comparing whether the measurement time is the same. Meanwhile, if the corresponding site data is not crawled, the traversed site is inquired under a hydrological site table or a water quality site table stored in the site information module, if the return value is not null, the site already exists, and the crawled data is directly added into the corresponding hydrological data table or the water quality data table; and if the return value is null, the station is a newly added station, a corresponding hydrological station table or water quality station table needs to be selected, and the station information is stored in the data table.
In order to prevent the data storage module from repeatedly storing data, the data storage module is also provided with a data cleaning module, and after the data cleaning module is started, the data stored in the hydrological data table and the water quality data table can be read and deleted according to the collected data stored in the hydrological data table and the water quality data table; or traversing the data table, and judging whether the data are repeated or not through the data measurement time.
It should be noted that, in order to improve the resource utilization rate, when implementing the technical solution disclosed in this embodiment, it may be considered to start the data cleaning module at intervals through the Python timed task scheduling framework apscheduler library.
In the embodiment, a real-time deduplication algorithm is adopted for the target to be crawled, so that non-repetition and accuracy of data can be ensured; a Python timing task scheduling framework is adopted, so that the resource utilization rate is effectively improved; and the situations of wrong deletion and missed deletion are avoided by measuring the time through data.
With reference to embodiments 1 to 3, the present invention discloses a method for acquiring hydrographic water quality data based on a web crawler technology, which includes the following steps (please refer to fig. 2 for a specific implementation procedure):
s1, under the data acquisition module, the plug-in is reminded through webpage monitoring to perform real-time monitoring on water quality and hydrological data under a target website; in the monitoring process, a network crawler technology is adopted, after a JSON file for subsequent analysis of water quality data is obtained by crawling the monitored water quality data, the encoded JSON character string in the JSON file is decoded into the Python object based on the decoding of the Python object, and the water quality data is obtained; for the monitored hydrological data, after crawling an xml file comprising webpage content, selecting a node or a node set in the xml file through an XPath grammar, and analyzing the webpage content to obtain the hydrological data;
it should be noted that, in step S1, for the monitored water quality data, after the URL of the target website is analyzed by the capture analysis tool, a post request is sent through a requests library in Python under the web crawler module, and after a json file for performing water quality data analysis at a later stage is obtained, decoding based on a Python object specifically includes:
Decoding the JSON character string coded in the JSON file into a Python object through a JSON library;
in step S1, for the monitored hydrologic data, the crawling of the xml file including the web page content specifically includes:
and combining a crawler frame of Python + Selenium + Chrome under the web crawler module to obtain an xml file comprising the web page content.
And S2, after the connection session between the server and the user is established, searching the hydrologic and water quality data collected in the data collection module according to preset searching conditions under the data processing module, setting analysis conditions based on the searching results, and outputting hydrologic and water quality analysis results.
The method is combined with an online monitoring probe, and comprises the following substeps when the hydrology and water quality data are crawled:
s11, when hydrology and water quality data monitoring is carried out by using the Distill Web Monitor plug-in, when data updating exists in a target website, crawling reminding information is sent to a mailbox Monitor module through an email; the method comprises the steps that measuring time of water quality data is confirmed through an Xpath grammar, and when the measuring time of monitored water quality data changes, a mailbox monitor module is driven to send crawling reminding information through an electronic mail;
S12, reading mail content through the mailbox monitor module, and driving the web crawler module to crawl data when determining that the target website has data update according to the crawling reminding information;
s13, combining with the deduplication module, in order to ensure the non-duplication and accuracy of the data, when the crawled data is read from the data storage module, the method comprises a data deduplication step, wherein the data deduplication step specifically comprises the following steps:
acquiring a hydrological and water quality data table from a site information establishing module through a duplicate removal module, performing data traversal, and determining that data under a corresponding site is crawled in the data traversal process;
if the data under the corresponding site is not crawled, storing the traversed data in a data storage module;
if the data under the corresponding site is not crawled, a hydrological and water quality site table is required to be inquired, whether the currently traversed site is recorded in the corresponding site table or not is judged, for the judgment object recorded under the site table, after the data table corresponding to the judgment object is determined, crawling data is added into the data table, for the judgment object not recorded under the site table, a site name is required to be added into the site table corresponding to the judgment object, a corresponding data table is created, and the crawled data are stored in the currently created data table.
The invention discloses a method and a system for acquiring hydrographic water quality data based on a web crawler technology. Through the technical scheme disclosed by the invention, data service is provided for the outside, and support is provided for analysis and display based on the data of the system so as to serve the commercial application of relevant tools of the system.
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (10)

1. A hydrology and water quality data acquisition system based on a web crawler technology is characterized by comprising the following modules:
the data acquisition module is used for monitoring the water quality and the hydrological data under the target website in real time through the webpage monitoring reminding plug-in; in the monitoring process, a network crawler technology is adopted, after a json file for subsequent water quality data analysis is obtained for monitored water quality data, a coded json character string in the json file is decoded into the Python object based on the decoding of the Python object, and then the water quality data is obtained; for the monitored hydrological data, after crawling an xml file comprising webpage content, selecting a node or a node set in the xml file through an XPath grammar, and analyzing the webpage content to obtain the hydrological data;
And the data processing module is used for searching the hydrological and water quality data collected in the data collection module according to preset searching conditions after a connection session between the server and the user is established, setting analysis conditions based on the searching results and outputting hydrological and water quality analysis results.
2. The hydrographic water quality data collection system of claim 1, wherein the data collection module comprises a web crawler module; wherein:
for the monitored water quality data, after the URL of a target website is analyzed through a bale capture analysis tool, the web crawler module sends a post request through a requests library in Python to obtain a JSON file for later water quality data analysis, and then the JSON character string coded in the JSON file is decoded into a Python object through the JSON library to obtain water quality data;
for the monitored hydrological data, the web crawler module is combined with a crawler frame of Python + Selenium + Chrome to obtain an xml file including webpage content.
3. The hydrological water quality data collection system of claim 2, wherein the data collection module comprises a web monitor module and a mailbox monitor module, wherein:
The webpage Monitor module is used for monitoring the real-time data of the target website at intervals by adopting a Distill Web Monitor plug-in arranged in a browser and sending crawling reminding information to the mailbox Monitor module through an email when data updating exists in the target website; the monitoring of the water quality data comprises confirming the measuring time of the water quality data through an Xpath grammar, and sending crawling reminding information when the measuring time of the monitored water quality data changes;
the mailbox monitor module is used for reading all mails of the mailbox and returning a mail list, recording the number of the mails when the mails are read, reading the content of the newly added mails in the mail list when the newly added mails are judged to exist, and driving the web crawler module to crawl the data when the data update of the target website is determined according to the crawling reminding information.
4. The hydrology and water quality data collection system of claim 3, wherein a timed task scheduling framework in Python is called, and mailbox monitoring modules are started at intervals;
under the mailbox monitor module, data storage is carried out in a mail list according to a first-in first-out queue storage mode; and if the number of the newly added mails is 1, reading the mails with the mail list index value of 0, and analyzing the mail contents through the regular expression when reading the mail contents.
5. The hydrographic water quality data collection system of claim 1, further comprising a data storage module;
and the data storage module calls a pymongo library in the Python to store hydrological and water quality data.
6. The hydrological water quality data collection system of claim 5, wherein the data storage module comprises a site information establishment module;
the station information module is used for establishing a national hydrological water quality station table, a hydrological station table, a water quality station table, a hydrological data table and a water quality data table by adopting a bjson format; wherein:
the national hydrological water quality site information table is used for storing national hydrological water quality site names and geographic coordinates;
the hydrological site table is used for storing the crawled hydrological site names;
the water quality site table is used for storing the name of the crawled water quality site;
the hydrological data table is used for storing a plurality of hydrological elements obtained by crawling;
the water quality data table is used for storing a plurality of items of water quality factors obtained by crawling;
the hydrology and water quality data tables comprise station measuring time.
7. The hydrographic water quality data collection system of claim 6, wherein the data collection module further comprises a deduplication module;
The duplication eliminating module is used for traversing data after acquiring hydrology and water quality data tables from the station information establishing module, and the data traversing process comprises the following steps:
inquiring the site inquiry time of the traversed data, and judging whether the site inquiry time in the current traversal process is the same as that in the previous traversal process; specifically, the method comprises the following steps:
if the data under the corresponding site are the same, determining that the data under the corresponding site are crawled;
if the data under the corresponding site are not the same, determining that the data under the corresponding site are not crawled, storing the currently traversed data in a data storage module on one hand, inquiring a hydrological and water quality site table on the other hand, judging whether the currently traversed site is recorded in the corresponding site table or not, adding crawl data into the corresponding data table for a judgment object recorded under the site table after determining the data table corresponding to the judgment object, adding a site name into the site table corresponding to the judgment object for the judgment object not recorded under the site table, creating the corresponding data table, and then storing the crawl data.
8. A hydrologic water quality data acquisition method of the hydrologic water quality data acquisition system based on the web crawler technology according to any one of claims 1-7, characterized by comprising the following steps:
S1, under the data acquisition module, the plug-in is reminded through webpage monitoring to perform real-time monitoring on water quality and hydrological data under a target website; in the monitoring process, a web crawler technology is adopted, after a json file for subsequent water quality data analysis is obtained for the monitored water quality data, the coded json character string in the json file is decoded into the Python object based on the decoding of the Python object, and the water quality data is obtained; for the monitored hydrological data, after crawling an xml file comprising webpage content, selecting a node or a node set in the xml file through an XPath grammar, and analyzing the webpage content to obtain the hydrological data;
and S2, after the connection session between the server and the user is established, searching the hydrologic and water quality data collected in the data collection module according to preset searching conditions under the data processing module, setting analysis conditions based on the searching results, and outputting hydrologic and water quality analysis results.
9. The method for acquiring hydrological water quality data according to claim 8, wherein in step S1, after analyzing the URL of a target website by a bale capture analysis tool for the monitored water quality data, a post request is sent by the web crawler module through a requests library in Python, and after obtaining a json file for later water quality data analysis, decoding based on Python objects specifically comprises:
Decoding the JSON character string coded in the JSON file into a Python object through a JSON library;
in step S1, for the monitored hydrologic data, the crawling of the xml file including the web page content specifically includes:
and combining a crawler frame of Python + Selenium + Chrome under the web crawler module to obtain an xml file comprising the web page content.
10. The hydrology and water quality data collection method of claim 8, wherein in step S1, a Distill Web Monitor plug-in installed in a browser is used to perform real-time data monitoring on the target website at intervals;
when the hydrology and water quality data are crawled, the method comprises the following substeps:
s11, when hydrology and water quality data monitoring is carried out by using the Distill Web Monitor plug-in, when data updating exists in a target website, crawling reminding information is sent to a mailbox Monitor module through an email; the method comprises the steps that the measurement time of water quality data is confirmed through an Xpath grammar, and when the measurement time of the monitored water quality data changes, a mailbox monitor module is driven to send crawling reminding information through an electronic mail;
s12, reading mail content through the mailbox monitor module, and driving the web crawler module to crawl data when determining that the target website has data update according to the crawling reminding information;
S13, when the crawl data is read from the data storage module, the method comprises a data deduplication step, wherein the data deduplication step specifically comprises the following steps:
acquiring a hydrological and water quality data table from a site information establishing module through a duplicate removal module, performing data traversal, and determining that data under a corresponding site is crawled in the data traversal process;
if the data under the corresponding site is not crawled, storing the traversed data in a data storage module;
if the data under the corresponding site is not crawled, a hydrological and water quality site table is required to be inquired, whether the currently traversed site is recorded in the corresponding site table or not is judged, for the judgment object recorded under the site table, after the data table corresponding to the judgment object is determined, crawling data is added into the data table, for the judgment object not recorded under the site table, a site name is required to be added into the site table corresponding to the judgment object, a corresponding data table is created, and then the crawling data is stored.
CN202010613400.5A 2020-06-30 2020-06-30 Hydrological water quality data acquisition method and system based on web crawler technology Pending CN111859067A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010613400.5A CN111859067A (en) 2020-06-30 2020-06-30 Hydrological water quality data acquisition method and system based on web crawler technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010613400.5A CN111859067A (en) 2020-06-30 2020-06-30 Hydrological water quality data acquisition method and system based on web crawler technology

Publications (1)

Publication Number Publication Date
CN111859067A true CN111859067A (en) 2020-10-30

Family

ID=72988966

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010613400.5A Pending CN111859067A (en) 2020-06-30 2020-06-30 Hydrological water quality data acquisition method and system based on web crawler technology

Country Status (1)

Country Link
CN (1) CN111859067A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304273A (en) * 2023-05-24 2023-06-23 中交第四航务工程勘察设计院有限公司 Management method of hydrological data display platform based on web crawler technology

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799686A (en) * 2012-07-30 2012-11-28 河海大学 Water resource information vertical search method based on cloud platform
CN105738587A (en) * 2016-01-12 2016-07-06 山东科技大学 Water quality monitoring system
CN107391651A (en) * 2017-07-17 2017-11-24 河海大学 Water conservancy information retrieval system and method based on web crawlers

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102799686A (en) * 2012-07-30 2012-11-28 河海大学 Water resource information vertical search method based on cloud platform
CN105738587A (en) * 2016-01-12 2016-07-06 山东科技大学 Water quality monitoring system
CN107391651A (en) * 2017-07-17 2017-11-24 河海大学 Water conservancy information retrieval system and method based on web crawlers

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
姚良: "Python3爬虫实战 数据清洗、数据分析与可视化", 31 October 2019, 中国铁道出版社, pages: 15 - 17 *
李杰: "互联网气象水文数据定向采集系统设计与实现", 《中国优秀硕士学位论文全文数据库 基础科学辑》, no. 02, pages 2 - 4 *
汤景泰: "危机传播管理", vol. 1, 31 March 2015, 经济日报出版社, pages: 124 - 129 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116304273A (en) * 2023-05-24 2023-06-23 中交第四航务工程勘察设计院有限公司 Management method of hydrological data display platform based on web crawler technology
CN116304273B (en) * 2023-05-24 2023-08-18 中交第四航务工程勘察设计院有限公司 Management method of hydrological data display platform based on web crawler technology

Similar Documents

Publication Publication Date Title
CN110489633B (en) Intelligent brain service system based on library data
Guide Reference
CN109658044A (en) The long APP management system in river and method
Zipper Agricultural research using social media data
CN110019616A (en) A kind of POI trend of the times state acquiring method and its equipment, storage medium, server
CN114648393A (en) Data mining method, system and equipment applied to bidding
CN113342976A (en) Method, device, storage medium and equipment for automatically acquiring and processing data
CN111859046A (en) Water pollution tracing system and method based on pollution element source analysis
CN105389330B (en) Across the community open source resources of one kind match correlating method
CN111859067A (en) Hydrological water quality data acquisition method and system based on web crawler technology
CN115309815A (en) Network public opinion monitoring system and method based on big data
CN117251414B (en) Data storage and processing method based on heterogeneous technology
CN116127047B (en) Method and device for establishing enterprise information base
Tavra et al. Unpacking the role of volunteered geographic information in disaster management: focus on data quality
CN116881535A (en) Public opinion comprehensive supervision system with timely early warning function
Tarboton et al. CUAHSI community Observations Data Model (ODM) version 1.1 design specifications
Zárate et al. Observational/hydrographic data of the South Atlantic Ocean published as LOD
Horn et al. An improved chronology for the microscopic charcoal and pollen records from Anderson Pond, Tennessee, USA
CN1768368A (en) Associating website clicks with links on a web page
Feng et al. Flood risk analysis based on information diffusion theory
CN112685652A (en) Information pushing method and system for enterprise users
CN110362705A (en) Intelligent P&ID management system and method
CN111678531A (en) Subway path planning method based on LightGBM
Jatowt et al. Towards mining past content of Web pages
Jayawardana et al. Modeling updates of scholarly webpages using archived data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination