CN110347895A - Ecological space data crawling method based on Web - Google Patents

Ecological space data crawling method based on Web Download PDF

Info

Publication number
CN110347895A
CN110347895A CN201910498905.9A CN201910498905A CN110347895A CN 110347895 A CN110347895 A CN 110347895A CN 201910498905 A CN201910498905 A CN 201910498905A CN 110347895 A CN110347895 A CN 110347895A
Authority
CN
China
Prior art keywords
data
url
sulfur dioxide
web
crawled
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910498905.9A
Other languages
Chinese (zh)
Inventor
白云
李川
刘岱
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Industrial And Commercial University Of Chongqing School Of Wisdom
Original Assignee
Industrial And Commercial University Of Chongqing School Of Wisdom
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Industrial And Commercial University Of Chongqing School Of Wisdom filed Critical Industrial And Commercial University Of Chongqing School Of Wisdom
Priority to CN201910498905.9A priority Critical patent/CN110347895A/en
Publication of CN110347895A publication Critical patent/CN110347895A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses the ecological space data crawling methods based on Web, belong to internet area, it is characterized in that, method includes the following steps: (1) sulfur dioxide data crawl: according to sulfur dioxide concentration data characteristics, selection crawls scheme, it carries out data to crawl, filters out the exceeded city of sulfur dioxide concentration;(2) indoor sulfur dioxide detection: sulfur dioxide concentration in the sulfur dioxide detector detection exceeded city room of sulfur dioxide concentration is used;(3) it comparing: is compared according to indoor sulfur dioxide detectable concentration with data are crawled;(4) sulfur dioxide desulfurization is handled: if indoor sulfur dioxide concentration in the trend gradually risen and close to sulfur dioxide concentration is crawled, opens sulfur dioxide desulfurization processing;Through the invention, technology screening is crawled using data and go out the exceeded city of sulfur dioxide concentration, targetedly carry out indoor sulfur dioxide concentration detection, and then provide corresponding indoor sulfur dioxide resolution.

Description

Ecological space data crawling method based on Web
Technical field
The invention belongs to internet areas, and in particular to the ecological space data crawling method based on Web.
Background technique
Web Spatial data capture mainly uses web crawlers technology, and web crawlers is also known as Web Spider, network robot, Be it is a kind of automatically grab the program or script of web message according to certain rules, traditional network crawler from one or The URL of several Initial pages starts, and the URL on Initial page is obtained, during grabbing webpage, constantly from current page It extracts new URL and is put into queue, the certain condition until meeting system stops.
Currently, Detection of Air Quality data have renewal speed fast, the big feature of data volume, tradition is based on web crawlers The research of web Spatial data capture is all based on the form of single machine web crawlers mostly;However, web spatial data is distributed widely in In different websites and renewal frequency is very fast, relies on single machine web crawlers crawl data in crawl coverage rate and crawl efficiency Upper standard is with meet demand, it is difficult to guarantee the crawl timeliness of data and comprehensive;Single machine web crawlers is simultaneously in order to improve data Efficiency is grabbed, multithreading asynchronous system is generallyd use and is realized, realizes big, not convenient for safeguarding, the easy hair deadlock situation of difficulty.
Content of sulfur dioxide is atmosphere pollution in the air quality data being collected into based on web spatial data crawling method One of major pollutants and acid rain formation the reason of one of, if sulfur dioxide concentration is exceeded in air, harm to the human body compared with Greatly, it is easy to appear eye, schneiderian membrane irritation, or even hedgehog hydnum and bronchial spasm occurs, is gently then gone into a coma, it is heavy then dead;If ring Sulfur dioxide concentration is exceeded in border, and plant also will appear " poisoning " symptom, and blade gradually fades, blade is wilted, vein bleaches, from And it causes death;It after sulfur dioxide is dissolved in water in air, can not only make soil and water acidification, be caused greatly to the mankind and plant Harm, while after sulfur dioxide and acid rain occur, cause serious obstruction for socio-economic development.
China's sulfur dioxide pollution is serious, and main cause is the industrial structure in China mainly based on Coal Industrial, coal Burning causes a large amount of sulfur dioxide and the discharge of other pollutants, at the same to the prevention and cure of pollution of sulfur dioxide work less than Position, thinks little of preserving the ecological environment.
Sulfur dioxide treatment method has physical method and chemical method at present, and wherein physical method is inhaled using wide Attached method, solvent absorption mainly have calcium method, sodium method and potassium method in chemical method;Single physical method desulfurization, generally requires to be added A large amount of chemical substance carries out sulfur dioxide treatment, and energy consumption waste is larger;Often due to reaction condition, treated in chemical method Contain unreacted solution in solution, causes separation difficult, influence to reuse, cause the waste of resource.
Summary of the invention
The present invention provides the ecological space data crawling methods based on Web, at least to solve in the related technology for sky Gas quality data collection data volume is excessive, and hash is excessive, the low problem of the quality of data, while in air quality data two Sulfur oxide is analyzed, and the influence factor for influencing content of sulfur dioxide is found out, and provides corresponding indoor sulfur dioxide resolution.
Ecological space data crawling method based on Web, comprising the following steps:
(1) sulfur dioxide data crawl: according to sulfur dioxide concentration data characteristics, selection crawls scheme, carries out data and climbs It takes, filters out the exceeded city of sulfur dioxide concentration;
(2) in the sulfur dioxide detector detection exceeded city room of sulfur dioxide concentration two indoor sulfur dioxide detection: are used Sulfur oxide concentration;
(3) it comparing: is compared according to indoor sulfur dioxide detectable concentration with data are crawled;
(4) sulfur dioxide desulfurization is handled: if indoor sulfur dioxide concentration is in the trend gradually risen and close to crawling dioxy Change sulphur concentration, then opens sulfur dioxide desulfurization processing.
Further, sulfur dioxide data crawl that specific step is as follows in the step (1):
1.1, selection crawls scheme: according to sulfur dioxide concentration data characteristics, data are more, updating decision, to environmental monitoring net Html page of standing is analyzed, and the data information URL and label of needs are searched, and formulates page letter according to the label and URL information It is distributed network crawler that breath, which crawls scheme,;
1.2, it crawls data: extracting related URL from webpage and URL queue is added, crawl data on website;
1.3, URL processing: reading URL, and URL duplicate removal extracts domain name and URL storage;
1.4, it cleans data: the data crawled is subjected to data cleansing, consistency check and processing invalid value and missing Value;Consistency check is the reasonable value range and correlation according to each variable, checks the relationship between data, and discovery is super Normal range (NR) or conflicting value out;It is invalid using estimation, directly rejecting, global variable filling, the processing of Random Interpolation method Value and missing values;
1.5, storing data: the data crawled are deposited in the database.
Further, specific step is as follows for the data that crawl:
2.1, crawler engine opens the Main Domain of an environmental monitoring website, the resolver for handling the website is found, from solution System is obtained in parser first has to the starting URLs grabbed;
2.2, obtained starting URLs is sent to scheduler by crawler engine, and scheduler joins it into host node In the shared URL queue to be crawled of Redis cache server storage;
2.3, crawler engine inquires the remaining URL in shared URL queue to be crawled to scheduler;
2.4, first URL to be crawled in shared URL queue to be crawled will be inquired and obtained to scheduler, and then crawler is drawn It holds up and network request corresponding to the URL is sent to downloader by downloader middleware;
2.5, downloader is downloaded Web page corresponding to the URL, then by sulfur dioxide in the air downloaded Content data passes to crawler engine by downloader middleware;
2.6, the data downloaded are passed to resolver by crawler middleware by crawler engine;
2.7, resolver is analyzed and processed the data downloaded, and therefrom extracts interested data item and new URLs is sent to crawler engine;
2.8, web path supplement is crawled, avoids webpage for the reference page, causes path is imperfect data collection is caused to go out It now lacks, carries out crawling web path supplement.
Wherein, if the judgment method required supplementation with are as follows: two the connected references page Pl, P2, if Pl is that P2 quotes the page, Completion path is just needed between two pages, if not the reference page, it is necessary to it checks in user access path and is visited with the presence or absence of P2 It asks the page, if not having, judges that P2 for user's new session process, does not need completion path again, if so, then showing that user is to execute Back operation has accessed P2 by Pl, needs completion path.
The method that complementing method generally takes matching father node is completed, when judging to require supplementation with path between two pages When, it first checks the parent page of P2, is matched with the father node of Pl, if they are the same, so that it may directly using the father node of Pl as P1 Fullpath between P2;If it is different, then needing to continue checking the grandparent node of Pl, matched with P2 father node, until All matched equal completions of P2 father node of need are into user access path.
Further, URL processing includes following below scheme, specific as follows:
3.1, it reads URL: it is enterprising to read a batch URL to Storm distributed platform from the URL queue of Redis database Row processing;
3.2, URL duplicate removals: the URL that filtering has crawled prevents web crawlers from repeating to crawl to identical URL, improves crawler System crawls efficiency;
3.3, it extracts domain name: extracting domain name from URL, the characteristics of according to website domain name, identify website URL belonging to URL Queue;
3.4, URL storages: URL is stored according to domain name into different website URL queues, URL storage uses TridentState is realized.
Further, data crawl as the main Heavy industrial city sulfur dioxide concentration such as northern thermal power plant, chemical industry.
Further, the scheme that sulfur dioxide is administered is membrane absorption method, by using the modified perfluoroethylene third of graphene oxide Alkene Hollow Fiber Membrane Absorption device, selects sodium hydroxide solution for absorbing liquid, realizes the target of desulfurization.
Further, wherein the modified perfluoroethylene-propylene hollow-fibre membrane of graphene oxide the preparation method comprises the following steps:
4.1, pore-foaming agent, the plasticizer neighbour's benzene that perfluoroethylene-propylene and nanometer grade silica and interface treating agent are formed After three kinds of material mixings of diformazan dioctyl phthalate are uniform, under vacuum conditions, graphene oxide are added and is sufficiently mixed, is carried out after drying Melt drawing;
Substance after wire drawing is carried out electrostatic spinning processing by 4.2, washes immersion with dehydrated alcohol extraction, it is modified to obtain graphene oxide Perfluoroethylene-propylene hollow-fibre membrane.
Further, wherein perfluoroethylene-propylene: pore-foaming agent: plasticizer mass ratio is 3:2:1, and graphene oxide quality is three The 0.2% of solution gross mass after kind mixing, being dried in vacuo lower temperature is 98 DEG C, drying time 10h.
Further, wherein electrostatic spinning is handled to carry out electrostatic spinning under the conditions of voltage 25kv, injection speed 2.0ml/h.
Beneficial effect
(1) present invention crawls technology using distributed network, and host node does not do data and crawls, and only carries out the tune for crawling task Degree, crawler node are responsible for the downloading and extraction of data, and deployment is simple, is easy to extend;Support breakpoint is continuous to climb, by fault restoration energy It is enough to run again, can data structure before fast quick-recovery, improve the stability of system;Host node is responsible for distributed network Each crawler node carries out load balancing in network crawler system, some crawler node overload or excessively idle is avoided, so that respectively climbing The workload that worm node is undertaken is roughly the same, improves crawl efficiency;It can be realized periodic update, improve resource benefit With rate.
(2) present invention is added to path supplement link in data crawl, when path supplement can be with completion accession page Complete access path improves the integrality that data crawl, data is avoided to lack, and improves the accuracy that data crawl.
(3) present invention stores advance row data cleaning treatment using in data, can reduce data exception or missing, make to receive The data collected are more accurate, while data after treatment can reduce the memory space in data storage, improve storage Efficiency.
(4) present invention uses membrane absorption method, and gas phase sulfur dioxide gas and sodium hydroxide solution are separated by perforated membrane, Sulfur dioxide gas is entered the position of gas liquid film by the duct in perforated membrane, after sulfur dioxide and sodium hydroxide react, no It is only capable of achieving the purpose that desulfurization, additionally it is possible to recycle Sulphur ressource.
(5) present invention increases the hydrophilic of membrane material using the modified perfluoroethylene-propylene hollow-fibre membrane of graphene oxide Property, electric conductivity, tensile strength enable modified hollow-fibre membrane preferably to carry out sulfur dioxide desulfurization processing.
Detailed description of the invention
Fig. 1 is the flow diagram that spatial data crawls;
Fig. 2 is film absorber sulfur dioxide absorption schematic device and flow chart;
Fig. 3 is to absorb front and back sulfur dioxide concentration histogram.
Specific embodiment
Clear, complete description is carried out below with reference to technical solution of the attached drawing to various embodiments of the present invention, it is clear that is retouched The embodiment stated is only a part of the embodiments of the present invention, instead of all the embodiments;Based on the embodiment of the present invention, originally Field those of ordinary skill obtained all other embodiment without making creative work, belongs to this hair Bright protected range.
Embodiment 1
Fig. 1 is the flow diagram that spatial data provided in this embodiment crawls, as shown in Figure 1, the ecology based on Web is empty Between data crawling method, comprising the following steps:
(1) sulfur dioxide data crawl: according to sulfur dioxide concentration data characteristics, selection crawls scheme, carries out data and climbs It takes, filters out the exceeded city of sulfur dioxide concentration;
(2) in the sulfur dioxide detector detection exceeded city room of sulfur dioxide concentration two indoor sulfur dioxide detection: are used Sulfur oxide concentration;
(3) it comparing: is compared according to indoor sulfur dioxide detectable concentration with data are crawled;
(4) sulfur dioxide desulfurization is handled: if indoor sulfur dioxide concentration is in the trend gradually risen and close to crawling dioxy Change sulphur concentration, then opens sulfur dioxide desulfurization processing.
Further, sulfur dioxide data crawl that specific step is as follows in the step (1):
1.1, selection crawls scheme: according to sulfur dioxide concentration data characteristics, data are more, updating decision, to environmental monitoring net Html page of standing is analyzed, and the data information URL and label of needs are searched, and formulates page letter according to the label and URL information It is distributed network crawler that breath, which crawls scheme,;
1.2, it crawls data: extracting related URL from webpage and URL queue is added, crawl data on website;
1.3, URL processing: reading URL, and URL duplicate removal extracts domain name and URL storage;
1.4, it cleans data: the data crawled is subjected to data cleansing, consistency check and processing invalid value and missing Value;Consistency check is the reasonable value range and correlation according to each variable, checks the relationship between data, and discovery is super Normal range (NR) or conflicting value out;It is invalid using estimation, directly rejecting, global variable filling, the processing of Random Interpolation method Value and missing values;
1.5, storing data: the data crawled are deposited in the database.
Further, specific step is as follows for the data that crawl:
2.1, crawler engine opens the Main Domain of an environmental monitoring website, the resolver for handling the website is found, from solution System is obtained in parser first has to the starting URLs grabbed;
2.2, obtained starting URLs is sent to scheduler by crawler engine, and scheduler joins it into host node In the shared URL queue to be crawled of Redis cache server storage;
2.3, crawler engine inquires the remaining URL in shared URL queue to be crawled to scheduler;
2.4, first URL to be crawled in shared URL queue to be crawled will be inquired and obtained to scheduler, and then crawler is drawn It holds up and network request corresponding to the URL is sent to downloader by downloader middleware;
2.5, downloader is downloaded Web page corresponding to the URL, then by sulfur dioxide in the air downloaded Content data passes to crawler engine by downloader middleware;
2.6, the data downloaded are passed to resolver by crawler middleware by crawler engine;
2.7, resolver is analyzed and processed the data downloaded, and therefrom extracts interested data item and new URLs is sent to crawler engine;
2.8, web path supplement is crawled, avoids webpage for the reference page, causes path is imperfect data collection is caused to go out It now lacks, carries out crawling web path supplement.
Wherein, if the judgment method required supplementation with are as follows: two the connected references page Pl, P2, if Pl is that P2 quotes the page, Completion path is just needed between two pages, if not the reference page, it is necessary to it checks in user access path and is visited with the presence or absence of P2 It asks the page, if not having, judges that P2 for user's new session process, does not need completion path again, if so, then showing that user is to execute Back operation has accessed P2 by Pl, needs completion path.
The method that complementing method generally takes matching father node is completed, when judging to require supplementation with path between two pages When, it first checks the parent page of P2, is matched with the father node of Pl, if they are the same, so that it may directly using the father node of Pl as P1 Fullpath between P2;If it is different, then needing to continue checking the grandparent node of Pl, matched with P2 father node, until All matched equal completions of P2 father node of need are into user access path.
Further, URL processing includes following below scheme, specific as follows:
3.1, it reads URL: it is enterprising to read a batch URL to Storm distributed platform from the URL queue of Redis database Row processing;
3.2, URL duplicate removals: the URL that filtering has crawled prevents web crawlers from repeating to crawl to identical URL, improves crawler System crawls efficiency;
3.3, it extracts domain name: extracting domain name from URL, the characteristics of according to website domain name, identify website URL belonging to URL Queue;
3.4, URL storages: URL is stored according to domain name into different website URL queues, URL storage uses TridentState is realized.
In the present embodiment data crawl in be added to path supplement link, path supplement is complete when can be with completion accession page Whole access path improves the integrality that data crawl, data is avoided to lack, and improves the accuracy that data crawl;Exist simultaneously Data store advance row data cleaning treatment, can reduce data exception or missing, keep the data being collected into more accurate, simultaneously Data after treatment can reduce the memory space in data storage, improve the efficiency of storage.
Embodiment 2
In view of Heating Period is different with non-heating period coal usage amount in northern Heavy industrial city, titanium dioxide in air is caused Sulfur content is also different, and coal burning is influenced caused by air in steam power plant, therefore the scheme administered for sulfur dioxide is specific Hydrogen-oxygen is selected by using the modified perfluoroethylene-propylene Hollow Fiber Membrane Absorption device of graphene oxide using membrane absorption method Change sodium solution is absorbing liquid, realizes the target of desulfurization.
Wherein the modified perfluoroethylene-propylene hollow-fibre membrane of graphene oxide the preparation method comprises the following steps:
4.1, pore-foaming agent, the plasticizer neighbour's benzene that perfluoroethylene-propylene and nanometer grade silica and interface treating agent are formed Three kinds of substances of diformazan dioctyl phthalate after mixing, dry under vacuum conditions, drying temperature is according to mass ratio 3:2:1 98 DEG C, drying time 10h, graphene oxide is added and is sufficiently mixed, wherein graphene oxide quality is solution after three kinds of mixing The 0.2% of gross mass carries out melt drawing after drying;
4.2, substance after wire drawing is subjected to electrostatic spinning processing, wherein electrostatic spinning processing is voltage 25kv, injection speed Obtained material dehydrated alcohol extraction is washed immersion after electrostatic spinning processing by 2.0ml/h, obtains the modified perfluoroethylene of graphene oxide Propylene hollow-fibre membrane.
The modified perfluoroethylene-propylene hollow-fibre membrane progress water contact angle of resulting graphene oxide will be tested, pure water leads to Amount, breaking strength, acid and alkali-resistance test:
Water contact angle test: dried sample film is fixed on glass slide, tests Static water contact angles at room temperature;
Pure water flux test: water flux test is carried out to film in the case where constant pressure is 0.15MPa;
Breaking strength: using the break-draw rate of cupping machine test film;
Resistance to acid and alkali test: film is separately immersed in dilute sulfuric acid, tests its performance after sodium hydrate aqueous solution immersion 30 days Conservation rate.
Table 1 is the modified perfluoroethylene-propylene hollow-fibre membrane performance evaluation of graphene oxide in embodiment 2:
Table 1:
It can be seen from Table 1 that the modified perfluoroethylene-propylene hollow-fibre membrane tool of graphene oxide for passing through this experiment preparation There is good permeability, elongation after failure rate is high, and acid-alkali-corrosive-resisting is high, with good application prospect.
Embodiment 3-5
Using the modified perfluoroethylene-propylene hollow-fibre membrane of graphene oxide obtained as film absorber, sodium hydroxide is selected Solution is absorbing liquid, and it is as shown in Figure 2 to carry out sulfur dioxide desulfurization experiment, specific experiment device and experiment flow figure in simulating chamber: Pass through compressed air and SO2Sulfurous gas in distribution simulating chamber, concentration are controlled using mass flow controller in (1 ± 0.1) mg/m3, and it is abundant by static mixer and air, sulfur-bearing gas pressure is no more than in import simulating chamber in operational process 1kPa, temperature setting are 25 DEG C, and absorbing liquid uses the sodium hydroxide solution of 0.1mol/L, and in film absorber front end, (film absorbs Before), rear end (film absorb after) each one gas detection mouth of setting, absorb SO for detecting film absorber2Effect.
Table 2 is the related ratio that film absorber adsorbs sulfur dioxide:
Table 2:
Fig. 3 is to absorb front and back sulfur dioxide concentration histogram, and it is modified poly- to can be seen that graphene oxide by table 2 and Fig. 3 Perfluoroethylene-propylene (copolymer) hollow-fibre membrane is high as the absorptivity of film absorber absorption sulfur dioxide, close to 100%, compared to tradition Desulfurization method, energy consumption is lower, operates conveniently, with good application prospect.
The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Anyone skilled in the art in the technical scope disclosed by the present invention, according to the technique and scheme of the present invention and its Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.

Claims (9)

1. the ecological space data crawling method based on Web, which comprises the following steps:
(1) sulfur dioxide data crawl: according to sulfur dioxide concentration data characteristics, selection crawls scheme, carries out data and crawls, sieves Select the exceeded city of sulfur dioxide concentration;
(2) indoor sulfur dioxide detection: titanium dioxide in the sulfur dioxide detector detection exceeded city room of sulfur dioxide concentration is used Sulphur concentration;
(3) it comparing: is compared according to indoor sulfur dioxide detectable concentration with data are crawled;
(4) sulfur dioxide desulfurization is handled: if indoor sulfur dioxide concentration is in the trend gradually risen and close to crawling sulfur dioxide Concentration then opens sulfur dioxide desulfurization processing.
2. the ecological space data crawling method according to claim 1 based on Web, titanium dioxide in the step (1) Sulphur data crawl that specific step is as follows:
1.1, selection crawls scheme: according to sulfur dioxide concentration data characteristics, data are more, updating decision, to environmental monitoring website Html page is analyzed, and the data information URL and label of needs are searched, and formulates page info according to the label and URL information Crawling scheme is distributed network crawler;
1.2, it crawls data: extracting related URL from webpage and URL queue is added, crawl data on website;
1.3, URL processing: reading URL, and URL duplicate removal extracts domain name and URL storage;
1.4, it cleans data: the data crawled is subjected to data cleansing, consistency check and processing invalid value and missing values;One The inspection of cause property is the reasonable value range and correlation according to each variable, checks the relationship between data, discovery is beyond just Normal range or conflicting value;Using estimation, directly reject, global variable filling, Random Interpolation method processing invalid value and Missing values;
1.5, storing data: the data crawled are deposited in the database.
3. the ecological space data crawling method according to claim 1 based on Web, described crawls data specific steps It is as follows:
2.1, crawler engine opens the Main Domain of an environmental monitoring website, finds the resolver for handling the website, analytically device Middle acquisition system first has to the starting URLs grabbed;
2.2, obtained starting URLs is sent to scheduler by crawler engine, and scheduler joins it into host node Redis In the shared URL queue to be crawled of cache server storage;
2.3, crawler engine inquires the remaining URL in shared URL queue to be crawled to scheduler;
2.4, first URL to be crawled in shared URL queue to be crawled will be inquired and obtained to scheduler, then crawler engine handle Network request corresponding to the URL is sent to downloader by downloader middleware;
2.5, downloader is downloaded Web page corresponding to the URL, the sulfur dioxide content in air that then will have been downloaded Data pass to crawler engine by downloader middleware;
2.6, the data downloaded are passed to resolver by crawler middleware by crawler engine;
2.7, resolver is analyzed and processed the data downloaded, and therefrom extracts interested data item and new URLs It is sent to crawler engine;
2.8, web path supplement is crawled, avoids webpage for the reference page, causes path is imperfect data collection is caused to lack It loses, carries out crawling web path supplement.
4. the ecological space data crawling method according to claim 1 based on Web, the URL processing includes following Process, specific as follows:
3.1, read URL: from carrying out on reading a batch URL to Storm distributed platform in the URL queue of Redis database Reason;
3.2, URL duplicate removals: the URL that filtering has crawled prevents web crawlers from repeating to crawl to identical URL, improves crawler system Crawl efficiency;
3.3, it extracts domain name: extracting domain name from URL, the characteristics of according to website domain name, identify website URL queue belonging to URL;
3.4, URL storages: URL is stored according to domain name into different website URL queues, URL storage uses TridentState is realized.
5. the ecological space data crawling method according to claim 1 based on Web, data crawl for northern thermal power plant, The main Heavy industrial city sulfur dioxide concentration such as chemical industry.
6. the ecological space data crawling method according to claim 1 based on Web, the scheme that sulfur dioxide is administered is film Absorption process, by using the modified perfluoroethylene-propylene Hollow Fiber Membrane Absorption device of graphene oxide, select sodium hydroxide solution for Absorbing liquid realizes the target of desulfurization.
7. the ecological space data crawling method according to claim 6 based on Web, wherein graphene oxide is modified poly- complete Fluoro ethyl propene hollow-fibre membrane the preparation method comprises the following steps:
4.1, pore-foaming agent, the plasticizer O-phthalic that perfluoroethylene-propylene and nanometer grade silica and interface treating agent are formed After three kinds of material mixings of dioctyl phthalate are uniform, under vacuum conditions, graphene oxide are added and is sufficiently mixed, is melted after drying Wire drawing;
Substance after wire drawing is carried out electrostatic spinning processing by 4.2, washes immersion with dehydrated alcohol extraction, and it is modified poly- complete to obtain graphene oxide Fluoro ethyl propene hollow-fibre membrane.
8. the ecological space data crawling method according to claim 7 based on Web, wherein perfluoroethylene-propylene: pore Agent: plasticizer mass ratio is 3:2:1, and graphene oxide quality is 0.2% of solution gross mass after three kinds of mixing, under vacuum drying Temperature is 98 DEG C, drying time 10h.
9. the ecological space data crawling method according to claim 7 based on Web, wherein electrostatic spinning processing is voltage Electrostatic spinning is carried out under the conditions of 25kv, injection speed 2.0ml/h.
CN201910498905.9A 2019-06-11 2019-06-11 Ecological space data crawling method based on Web Pending CN110347895A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910498905.9A CN110347895A (en) 2019-06-11 2019-06-11 Ecological space data crawling method based on Web

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910498905.9A CN110347895A (en) 2019-06-11 2019-06-11 Ecological space data crawling method based on Web

Publications (1)

Publication Number Publication Date
CN110347895A true CN110347895A (en) 2019-10-18

Family

ID=68181754

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910498905.9A Pending CN110347895A (en) 2019-06-11 2019-06-11 Ecological space data crawling method based on Web

Country Status (1)

Country Link
CN (1) CN110347895A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183060A (en) * 2020-09-28 2021-01-05 重庆工商大学 Reference resolution method of multi-round dialogue system
CN112417073A (en) * 2020-11-18 2021-02-26 中科三清科技有限公司 Automatic air quality condition broadcasting method and device, electronic equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2472235Y (en) * 2001-03-21 2002-01-16 南京大学 Indoor air quality monitor
CN104329773A (en) * 2014-10-29 2015-02-04 无锡悟莘科技有限公司 Control method of ventilation air conditioner based on CO (Carbon Monoxide) detection
CN109351209A (en) * 2018-12-20 2019-02-19 天津工业大学 A kind of the film formula and preparation method of perfluoroethylene-propylene hollow-fibre membrane

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN2472235Y (en) * 2001-03-21 2002-01-16 南京大学 Indoor air quality monitor
CN104329773A (en) * 2014-10-29 2015-02-04 无锡悟莘科技有限公司 Control method of ventilation air conditioner based on CO (Carbon Monoxide) detection
CN109351209A (en) * 2018-12-20 2019-02-19 天津工业大学 A kind of the film formula and preparation method of perfluoroethylene-propylene hollow-fibre membrane

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
曾李阳: ""基于分布式网络爬虫的Web空间数据获取与管理方法研究"", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *
李中良: ""基于Web日志挖掘和关联规则的个性化推荐系统模型研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
荣晗: ""基于分布式的网络爬虫系统的研究与实现"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
赵猛: ""基于数据挖掘技术的大气环境预测研究"", 《中国优秀硕士学位论文全文数据库 工程科技Ⅰ辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183060A (en) * 2020-09-28 2021-01-05 重庆工商大学 Reference resolution method of multi-round dialogue system
CN112183060B (en) * 2020-09-28 2022-05-10 重庆工商大学 Reference resolution method of multi-round dialogue system
CN112417073A (en) * 2020-11-18 2021-02-26 中科三清科技有限公司 Automatic air quality condition broadcasting method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110347895A (en) Ecological space data crawling method based on Web
Hollander et al. Inhibition and enhancement in the analysis of airborne endotoxin levels in various occupational environments
CN105131185A (en) Pineapple waste hemicellulose based pH sensitive type porous hydrogel as well as preparation method and application thereof
CN104525136B (en) A kind of composite and its production and use
CN104014313A (en) Improved wheat husk adsorbent
CN110257126A (en) A kind of greasy filth modifying agent and its preparation method and application
CN107163964A (en) A kind of multi-production process of pomelo peel regeneration product, shaddock peel adsorbent, and, shaddock peel essential oil
CN102423693B (en) Preparation and application methods of sulfur dioxide rice straw adsorbent
CN106914066A (en) A kind of grease proofing acupuncture environmental protection coated filter material terylene filter felt of PPS water repellents
CN109535456B (en) Application of cellulose nanocrystalline rainbow film in rapid and accurate detection of formaldehyde
CN106362712A (en) Rice husk base ion-exchange adsorption material, preparation method thereof and application
CN102735735B (en) Functional bismuth oxyiodide nanoflake array photoelectric organophosphorus pesticide biosensor and preparation method thereof
CN108069424A (en) A kind of method that agricultural crop straw prepares low ash content active carbon with high specific surface area
CN107118304A (en) A kind of preparation method of kapok fiber oil absorption material
CN108896737B (en) Farmland heavy metal pollution on-line monitoring early warning and real-time processing system
CN106268681A (en) A kind of method utilizing shrimp and crab shells to prepare environmentally friendly dye sorbent
CN106317258A (en) Production technology of low molecular weight heparin sodium without residual organic solvent
CN103698253B (en) A kind of method of phytoplankton absorption coefficients in separating granular
CN103901140B (en) A kind of pre-treating method analyzed for tetrabromobisphenol A in ight soil after biology contamination
CN104014310A (en) Method for synthesizing multifunctional composite water treatment agent
CN109569535A (en) The adsorbent material and preparation method of radioiodine in a kind of enriching seawater
CN105709700B (en) A kind of dimethyl diallyl ammonium chloride is modified the preparation of reed rod adsorbent
CN105921131A (en) Preparation method for silver extraction material
CN105854826B (en) A kind of preparation of vinylbenzyltrimethyl ammonium chloride reed rod adsorbent
CN113198428A (en) Method for preparing three-dimensional multifunctional adsorbing material in situ by using corn pith and application thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191018

RJ01 Rejection of invention patent application after publication