CN110347895A

CN110347895A - Ecological space data crawling method based on Web

Info

Publication number: CN110347895A
Application number: CN201910498905.9A
Authority: CN
Inventors: 白云; 李川; 刘岱
Original assignee: Industrial And Commercial University Of Chongqing School Of Wisdom
Current assignee: Industrial And Commercial University Of Chongqing School Of Wisdom
Priority date: 2019-06-11
Filing date: 2019-06-11
Publication date: 2019-10-18

Abstract

The invention discloses the ecological space data crawling methods based on Web, belong to internet area, it is characterized in that, method includes the following steps: (1) sulfur dioxide data crawl: according to sulfur dioxide concentration data characteristics, selection crawls scheme, it carries out data to crawl, filters out the exceeded city of sulfur dioxide concentration；(2) indoor sulfur dioxide detection: sulfur dioxide concentration in the sulfur dioxide detector detection exceeded city room of sulfur dioxide concentration is used；(3) it comparing: is compared according to indoor sulfur dioxide detectable concentration with data are crawled；(4) sulfur dioxide desulfurization is handled: if indoor sulfur dioxide concentration in the trend gradually risen and close to sulfur dioxide concentration is crawled, opens sulfur dioxide desulfurization processing；Through the invention, technology screening is crawled using data and go out the exceeded city of sulfur dioxide concentration, targetedly carry out indoor sulfur dioxide concentration detection, and then provide corresponding indoor sulfur dioxide resolution.

Description

Ecological space data crawling method based on Web

Technical field

The invention belongs to internet areas, and in particular to the ecological space data crawling method based on Web.

Background technique

Web Spatial data capture mainly uses web crawlers technology, and web crawlers is also known as Web Spider, network robot, Be it is a kind of automatically grab the program or script of web message according to certain rules, traditional network crawler from one or The URL of several Initial pages starts, and the URL on Initial page is obtained, during grabbing webpage, constantly from current page It extracts new URL and is put into queue, the certain condition until meeting system stops.

Currently, Detection of Air Quality data have renewal speed fast, the big feature of data volume, tradition is based on web crawlers The research of web Spatial data capture is all based on the form of single machine web crawlers mostly；However, web spatial data is distributed widely in In different websites and renewal frequency is very fast, relies on single machine web crawlers crawl data in crawl coverage rate and crawl efficiency Upper standard is with meet demand, it is difficult to guarantee the crawl timeliness of data and comprehensive；Single machine web crawlers is simultaneously in order to improve data Efficiency is grabbed, multithreading asynchronous system is generallyd use and is realized, realizes big, not convenient for safeguarding, the easy hair deadlock situation of difficulty.

Content of sulfur dioxide is atmosphere pollution in the air quality data being collected into based on web spatial data crawling method One of major pollutants and acid rain formation the reason of one of, if sulfur dioxide concentration is exceeded in air, harm to the human body compared with Greatly, it is easy to appear eye, schneiderian membrane irritation, or even hedgehog hydnum and bronchial spasm occurs, is gently then gone into a coma, it is heavy then dead；If ring Sulfur dioxide concentration is exceeded in border, and plant also will appear " poisoning " symptom, and blade gradually fades, blade is wilted, vein bleaches, from And it causes death；It after sulfur dioxide is dissolved in water in air, can not only make soil and water acidification, be caused greatly to the mankind and plant Harm, while after sulfur dioxide and acid rain occur, cause serious obstruction for socio-economic development.

China's sulfur dioxide pollution is serious, and main cause is the industrial structure in China mainly based on Coal Industrial, coal Burning causes a large amount of sulfur dioxide and the discharge of other pollutants, at the same to the prevention and cure of pollution of sulfur dioxide work less than Position, thinks little of preserving the ecological environment.

Sulfur dioxide treatment method has physical method and chemical method at present, and wherein physical method is inhaled using wide Attached method, solvent absorption mainly have calcium method, sodium method and potassium method in chemical method；Single physical method desulfurization, generally requires to be added A large amount of chemical substance carries out sulfur dioxide treatment, and energy consumption waste is larger；Often due to reaction condition, treated in chemical method Contain unreacted solution in solution, causes separation difficult, influence to reuse, cause the waste of resource.

Summary of the invention

The present invention provides the ecological space data crawling methods based on Web, at least to solve in the related technology for sky Gas quality data collection data volume is excessive, and hash is excessive, the low problem of the quality of data, while in air quality data two Sulfur oxide is analyzed, and the influence factor for influencing content of sulfur dioxide is found out, and provides corresponding indoor sulfur dioxide resolution.

Ecological space data crawling method based on Web, comprising the following steps:

(1) sulfur dioxide data crawl: according to sulfur dioxide concentration data characteristics, selection crawls scheme, carries out data and climbs It takes, filters out the exceeded city of sulfur dioxide concentration；

(2) in the sulfur dioxide detector detection exceeded city room of sulfur dioxide concentration two indoor sulfur dioxide detection: are used Sulfur oxide concentration；

(3) it comparing: is compared according to indoor sulfur dioxide detectable concentration with data are crawled；

(4) sulfur dioxide desulfurization is handled: if indoor sulfur dioxide concentration is in the trend gradually risen and close to crawling dioxy Change sulphur concentration, then opens sulfur dioxide desulfurization processing.

Further, sulfur dioxide data crawl that specific step is as follows in the step (1):

1.1, selection crawls scheme: according to sulfur dioxide concentration data characteristics, data are more, updating decision, to environmental monitoring net Html page of standing is analyzed, and the data information URL and label of needs are searched, and formulates page letter according to the label and URL information It is distributed network crawler that breath, which crawls scheme,；

1.2, it crawls data: extracting related URL from webpage and URL queue is added, crawl data on website；

1.3, URL processing: reading URL, and URL duplicate removal extracts domain name and URL storage；

1.4, it cleans data: the data crawled is subjected to data cleansing, consistency check and processing invalid value and missing Value；Consistency check is the reasonable value range and correlation according to each variable, checks the relationship between data, and discovery is super Normal range (NR) or conflicting value out；It is invalid using estimation, directly rejecting, global variable filling, the processing of Random Interpolation method Value and missing values；

1.5, storing data: the data crawled are deposited in the database.

Further, specific step is as follows for the data that crawl:

2.1, crawler engine opens the Main Domain of an environmental monitoring website, the resolver for handling the website is found, from solution System is obtained in parser first has to the starting URLs grabbed；

2.2, obtained starting URLs is sent to scheduler by crawler engine, and scheduler joins it into host node In the shared URL queue to be crawled of Redis cache server storage；

2.3, crawler engine inquires the remaining URL in shared URL queue to be crawled to scheduler；

2.4, first URL to be crawled in shared URL queue to be crawled will be inquired and obtained to scheduler, and then crawler is drawn It holds up and network request corresponding to the URL is sent to downloader by downloader middleware；

2.5, downloader is downloaded Web page corresponding to the URL, then by sulfur dioxide in the air downloaded Content data passes to crawler engine by downloader middleware；

2.6, the data downloaded are passed to resolver by crawler middleware by crawler engine；

2.7, resolver is analyzed and processed the data downloaded, and therefrom extracts interested data item and new URLs is sent to crawler engine；

2.8, web path supplement is crawled, avoids webpage for the reference page, causes path is imperfect data collection is caused to go out It now lacks, carries out crawling web path supplement.

Wherein, if the judgment method required supplementation with are as follows: two the connected references page Pl, P2, if Pl is that P2 quotes the page, Completion path is just needed between two pages, if not the reference page, it is necessary to it checks in user access path and is visited with the presence or absence of P2 It asks the page, if not having, judges that P2 for user's new session process, does not need completion path again, if so, then showing that user is to execute Back operation has accessed P2 by Pl, needs completion path.

The method that complementing method generally takes matching father node is completed, when judging to require supplementation with path between two pages When, it first checks the parent page of P2, is matched with the father node of Pl, if they are the same, so that it may directly using the father node of Pl as P1 Fullpath between P2；If it is different, then needing to continue checking the grandparent node of Pl, matched with P2 father node, until All matched equal completions of P2 father node of need are into user access path.

Further, URL processing includes following below scheme, specific as follows:

3.1, it reads URL: it is enterprising to read a batch URL to Storm distributed platform from the URL queue of Redis database Row processing；

3.2, URL duplicate removals: the URL that filtering has crawled prevents web crawlers from repeating to crawl to identical URL, improves crawler System crawls efficiency；

3.3, it extracts domain name: extracting domain name from URL, the characteristics of according to website domain name, identify website URL belonging to URL Queue；

3.4, URL storages: URL is stored according to domain name into different website URL queues, URL storage uses TridentState is realized.

Further, data crawl as the main Heavy industrial city sulfur dioxide concentration such as northern thermal power plant, chemical industry.

Further, the scheme that sulfur dioxide is administered is membrane absorption method, by using the modified perfluoroethylene third of graphene oxide Alkene Hollow Fiber Membrane Absorption device, selects sodium hydroxide solution for absorbing liquid, realizes the target of desulfurization.

Further, wherein the modified perfluoroethylene-propylene hollow-fibre membrane of graphene oxide the preparation method comprises the following steps:

4.1, pore-foaming agent, the plasticizer neighbour's benzene that perfluoroethylene-propylene and nanometer grade silica and interface treating agent are formed After three kinds of material mixings of diformazan dioctyl phthalate are uniform, under vacuum conditions, graphene oxide are added and is sufficiently mixed, is carried out after drying Melt drawing；

Substance after wire drawing is carried out electrostatic spinning processing by 4.2, washes immersion with dehydrated alcohol extraction, it is modified to obtain graphene oxide Perfluoroethylene-propylene hollow-fibre membrane.

Further, wherein perfluoroethylene-propylene: pore-foaming agent: plasticizer mass ratio is 3:2:1, and graphene oxide quality is three The 0.2% of solution gross mass after kind mixing, being dried in vacuo lower temperature is 98 DEG C, drying time 10h.

Further, wherein electrostatic spinning is handled to carry out electrostatic spinning under the conditions of voltage 25kv, injection speed 2.0ml/h.

Beneficial effect

(1) present invention crawls technology using distributed network, and host node does not do data and crawls, and only carries out the tune for crawling task Degree, crawler node are responsible for the downloading and extraction of data, and deployment is simple, is easy to extend；Support breakpoint is continuous to climb, by fault restoration energy It is enough to run again, can data structure before fast quick-recovery, improve the stability of system；Host node is responsible for distributed network Each crawler node carries out load balancing in network crawler system, some crawler node overload or excessively idle is avoided, so that respectively climbing The workload that worm node is undertaken is roughly the same, improves crawl efficiency；It can be realized periodic update, improve resource benefit With rate.

(2) present invention is added to path supplement link in data crawl, when path supplement can be with completion accession page Complete access path improves the integrality that data crawl, data is avoided to lack, and improves the accuracy that data crawl.

(3) present invention stores advance row data cleaning treatment using in data, can reduce data exception or missing, make to receive The data collected are more accurate, while data after treatment can reduce the memory space in data storage, improve storage Efficiency.

(4) present invention uses membrane absorption method, and gas phase sulfur dioxide gas and sodium hydroxide solution are separated by perforated membrane, Sulfur dioxide gas is entered the position of gas liquid film by the duct in perforated membrane, after sulfur dioxide and sodium hydroxide react, no It is only capable of achieving the purpose that desulfurization, additionally it is possible to recycle Sulphur ressource.

(5) present invention increases the hydrophilic of membrane material using the modified perfluoroethylene-propylene hollow-fibre membrane of graphene oxide Property, electric conductivity, tensile strength enable modified hollow-fibre membrane preferably to carry out sulfur dioxide desulfurization processing.

Detailed description of the invention

Fig. 1 is the flow diagram that spatial data crawls；

Fig. 2 is film absorber sulfur dioxide absorption schematic device and flow chart；

Fig. 3 is to absorb front and back sulfur dioxide concentration histogram.

Specific embodiment

Clear, complete description is carried out below with reference to technical solution of the attached drawing to various embodiments of the present invention, it is clear that is retouched The embodiment stated is only a part of the embodiments of the present invention, instead of all the embodiments；Based on the embodiment of the present invention, originally Field those of ordinary skill obtained all other embodiment without making creative work, belongs to this hair Bright protected range.

Embodiment 1

Fig. 1 is the flow diagram that spatial data provided in this embodiment crawls, as shown in Figure 1, the ecology based on Web is empty Between data crawling method, comprising the following steps:

1.5, storing data: the data crawled are deposited in the database.

Further, specific step is as follows for the data that crawl:

Further, URL processing includes following below scheme, specific as follows:

In the present embodiment data crawl in be added to path supplement link, path supplement is complete when can be with completion accession page Whole access path improves the integrality that data crawl, data is avoided to lack, and improves the accuracy that data crawl；Exist simultaneously Data store advance row data cleaning treatment, can reduce data exception or missing, keep the data being collected into more accurate, simultaneously Data after treatment can reduce the memory space in data storage, improve the efficiency of storage.

Embodiment 2

In view of Heating Period is different with non-heating period coal usage amount in northern Heavy industrial city, titanium dioxide in air is caused Sulfur content is also different, and coal burning is influenced caused by air in steam power plant, therefore the scheme administered for sulfur dioxide is specific Hydrogen-oxygen is selected by using the modified perfluoroethylene-propylene Hollow Fiber Membrane Absorption device of graphene oxide using membrane absorption method Change sodium solution is absorbing liquid, realizes the target of desulfurization.

Wherein the modified perfluoroethylene-propylene hollow-fibre membrane of graphene oxide the preparation method comprises the following steps:

4.1, pore-foaming agent, the plasticizer neighbour's benzene that perfluoroethylene-propylene and nanometer grade silica and interface treating agent are formed Three kinds of substances of diformazan dioctyl phthalate after mixing, dry under vacuum conditions, drying temperature is according to mass ratio 3:2:1 98 DEG C, drying time 10h, graphene oxide is added and is sufficiently mixed, wherein graphene oxide quality is solution after three kinds of mixing The 0.2% of gross mass carries out melt drawing after drying；

4.2, substance after wire drawing is subjected to electrostatic spinning processing, wherein electrostatic spinning processing is voltage 25kv, injection speed Obtained material dehydrated alcohol extraction is washed immersion after electrostatic spinning processing by 2.0ml/h, obtains the modified perfluoroethylene of graphene oxide Propylene hollow-fibre membrane.

The modified perfluoroethylene-propylene hollow-fibre membrane progress water contact angle of resulting graphene oxide will be tested, pure water leads to Amount, breaking strength, acid and alkali-resistance test:

Water contact angle test: dried sample film is fixed on glass slide, tests Static water contact angles at room temperature；

Pure water flux test: water flux test is carried out to film in the case where constant pressure is 0.15MPa；

Breaking strength: using the break-draw rate of cupping machine test film；

Resistance to acid and alkali test: film is separately immersed in dilute sulfuric acid, tests its performance after sodium hydrate aqueous solution immersion 30 days Conservation rate.

Table 1 is the modified perfluoroethylene-propylene hollow-fibre membrane performance evaluation of graphene oxide in embodiment 2:

Table 1:

It can be seen from Table 1 that the modified perfluoroethylene-propylene hollow-fibre membrane tool of graphene oxide for passing through this experiment preparation There is good permeability, elongation after failure rate is high, and acid-alkali-corrosive-resisting is high, with good application prospect.

Embodiment 3-5

Using the modified perfluoroethylene-propylene hollow-fibre membrane of graphene oxide obtained as film absorber, sodium hydroxide is selected Solution is absorbing liquid, and it is as shown in Figure 2 to carry out sulfur dioxide desulfurization experiment, specific experiment device and experiment flow figure in simulating chamber: Pass through compressed air and SO₂Sulfurous gas in distribution simulating chamber, concentration are controlled using mass flow controller in (1 ± 0.1) mg/m³, and it is abundant by static mixer and air, sulfur-bearing gas pressure is no more than in import simulating chamber in operational process 1kPa, temperature setting are 25 DEG C, and absorbing liquid uses the sodium hydroxide solution of 0.1mol/L, and in film absorber front end, (film absorbs Before), rear end (film absorb after) each one gas detection mouth of setting, absorb SO for detecting film absorber₂Effect.

Table 2 is the related ratio that film absorber adsorbs sulfur dioxide:

Table 2:

Fig. 3 is to absorb front and back sulfur dioxide concentration histogram, and it is modified poly- to can be seen that graphene oxide by table 2 and Fig. 3 Perfluoroethylene-propylene (copolymer) hollow-fibre membrane is high as the absorptivity of film absorber absorption sulfur dioxide, close to 100%, compared to tradition Desulfurization method, energy consumption is lower, operates conveniently, with good application prospect.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, Anyone skilled in the art in the technical scope disclosed by the present invention, according to the technique and scheme of the present invention and its Inventive concept is subject to equivalent substitution or change, should be covered by the protection scope of the present invention.

Claims

1. the ecological space data crawling method based on Web, which comprises the following steps:

(1) sulfur dioxide data crawl: according to sulfur dioxide concentration data characteristics, selection crawls scheme, carries out data and crawls, sieves Select the exceeded city of sulfur dioxide concentration；

(2) indoor sulfur dioxide detection: titanium dioxide in the sulfur dioxide detector detection exceeded city room of sulfur dioxide concentration is used Sulphur concentration；

(4) sulfur dioxide desulfurization is handled: if indoor sulfur dioxide concentration is in the trend gradually risen and close to crawling sulfur dioxide Concentration then opens sulfur dioxide desulfurization processing.

2. the ecological space data crawling method according to claim 1 based on Web, titanium dioxide in the step (1) Sulphur data crawl that specific step is as follows:

1.1, selection crawls scheme: according to sulfur dioxide concentration data characteristics, data are more, updating decision, to environmental monitoring website Html page is analyzed, and the data information URL and label of needs are searched, and formulates page info according to the label and URL information Crawling scheme is distributed network crawler；

1.4, it cleans data: the data crawled is subjected to data cleansing, consistency check and processing invalid value and missing values；One The inspection of cause property is the reasonable value range and correlation according to each variable, checks the relationship between data, discovery is beyond just Normal range or conflicting value；Using estimation, directly reject, global variable filling, Random Interpolation method processing invalid value and Missing values；

1.5, storing data: the data crawled are deposited in the database.

3. the ecological space data crawling method according to claim 1 based on Web, described crawls data specific steps It is as follows:

2.1, crawler engine opens the Main Domain of an environmental monitoring website, finds the resolver for handling the website, analytically device Middle acquisition system first has to the starting URLs grabbed；

2.2, obtained starting URLs is sent to scheduler by crawler engine, and scheduler joins it into host node Redis In the shared URL queue to be crawled of cache server storage；

2.4, first URL to be crawled in shared URL queue to be crawled will be inquired and obtained to scheduler, then crawler engine handle Network request corresponding to the URL is sent to downloader by downloader middleware；

2.5, downloader is downloaded Web page corresponding to the URL, the sulfur dioxide content in air that then will have been downloaded Data pass to crawler engine by downloader middleware；

2.7, resolver is analyzed and processed the data downloaded, and therefrom extracts interested data item and new URLs It is sent to crawler engine；

2.8, web path supplement is crawled, avoids webpage for the reference page, causes path is imperfect data collection is caused to lack It loses, carries out crawling web path supplement.

4. the ecological space data crawling method according to claim 1 based on Web, the URL processing includes following Process, specific as follows:

3.1, read URL: from carrying out on reading a batch URL to Storm distributed platform in the URL queue of Redis database Reason；

3.2, URL duplicate removals: the URL that filtering has crawled prevents web crawlers from repeating to crawl to identical URL, improves crawler system Crawl efficiency；

3.3, it extracts domain name: extracting domain name from URL, the characteristics of according to website domain name, identify website URL queue belonging to URL；

5. the ecological space data crawling method according to claim 1 based on Web, data crawl for northern thermal power plant, The main Heavy industrial city sulfur dioxide concentration such as chemical industry.

6. the ecological space data crawling method according to claim 1 based on Web, the scheme that sulfur dioxide is administered is film Absorption process, by using the modified perfluoroethylene-propylene Hollow Fiber Membrane Absorption device of graphene oxide, select sodium hydroxide solution for Absorbing liquid realizes the target of desulfurization.

7. the ecological space data crawling method according to claim 6 based on Web, wherein graphene oxide is modified poly- complete Fluoro ethyl propene hollow-fibre membrane the preparation method comprises the following steps:

4.1, pore-foaming agent, the plasticizer O-phthalic that perfluoroethylene-propylene and nanometer grade silica and interface treating agent are formed After three kinds of material mixings of dioctyl phthalate are uniform, under vacuum conditions, graphene oxide are added and is sufficiently mixed, is melted after drying Wire drawing；

Substance after wire drawing is carried out electrostatic spinning processing by 4.2, washes immersion with dehydrated alcohol extraction, and it is modified poly- complete to obtain graphene oxide Fluoro ethyl propene hollow-fibre membrane.

8. the ecological space data crawling method according to claim 7 based on Web, wherein perfluoroethylene-propylene: pore Agent: plasticizer mass ratio is 3:2:1, and graphene oxide quality is 0.2% of solution gross mass after three kinds of mixing, under vacuum drying Temperature is 98 DEG C, drying time 10h.

9. the ecological space data crawling method according to claim 7 based on Web, wherein electrostatic spinning processing is voltage Electrostatic spinning is carried out under the conditions of 25kv, injection speed 2.0ml/h.