A kind of big data acquisition method and system towards enterprise in charge of construction
Technical field
The present invention relates to the data acquisition technology field of construction industry, and in particular to a kind of towards enterprise in charge of construction
Big data acquisition method and system.
Background technology
With developing rapidly for network (web), WWW turns into the carrier of bulk information, how to efficiently extract and utilize
These information turn into a huge challenge.Search engine (SearchEngine), such as traditional pass through search engine
Yahoo!With Google etc., the instrument of information is retrieved as an auxiliary people turns into the entrance and guide that user accesses web.But
To be that these are general search plain engine can not capture the web page resources of orientation there is also certain limitation.
In order to solve the above problems, web crawlers arises at the historic moment, and web crawlers can orient crawl related web page resource.It
It is an automatic program for downloading webpage, it selectively accesses the webpage and phase on WWW according to set crawl target
The link of pass, so as to obtain the information required for user.
In the prior art, web crawlers is to obtain information by crawling web page mostly, and mostly excellent using depth
First search strategy, i.e., in a HTML (HyperText Markup Language, HTML) file, choosing
Select one of hyperlink label and carry out deep search, when being hyperlinked to the bottom up to traveling through this, judged by logical operation
This layer of search terminates, and with backed off after random this layer circulation, returns to upper strata and circulates and start to search for other hyperlink labels, until initial
Hyperlink in file is traversed completion, and Internet protocol (Internet Protocol, IP) generation can be set during crawling
Reason, so as to prevent anti-reptile.
In the prior art, web crawlers uses the search strategy of depth-first mostly, and in data structure increasingly complexity
In the case of, longitudinal level of website can infinitely increase and cross reference occurs between different levels, it may occur that Infinite Cyclic
Situation, only bolt down procedure can just exit traversal by force, and obtained information is because largely repetition and redundancy, quality are difficult to protect
Card, once program, in the process of implementation because network or other reasons are interrupted, the point of interruption is difficult to find out, and also can not just determine reptile
The starting point restarted, thus it is less efficient to crawling for web page resources in the prior art.
In addition, web crawlers is not the web page crawl for being specially adapted for construction industry in the prior art, do not examine
Consider the characteristic of the industry, therefore be badly in need of a kind of web page resources suitable for construction industry and crawl scheme.
The content of the invention
It is an object of the invention to provide a kind of big data acquisition method and system towards enterprise in charge of construction, for carrying
For the big data collection mechanism suitable for construction industry, the web page resources collection to enterprise in charge of construction is realized.
In order to achieve the above object, following technical scheme as present invention use:
On the one hand, the present invention provides a kind of big data acquisition method towards enterprise in charge of construction, including:
Step 1: multiple enterprises in charge of construction are obtained successively according to the hierarchical regions position relationship where enterprise in charge of construction
Respective uniform resource position mark URL address and corresponding enterprise name;
Step 2: the respective URL addresses of the multiple enterprise in charge of construction are saved in URL texts, by described in
Multiple respective enterprise names of enterprise in charge of construction are saved in name text file;
Step 3: read the URL addresses of the first enterprise in charge of construction from the URL texts, and from described
The enterprise name of first enterprise in charge of construction is read in name text file, wherein, first enterprise in charge of construction
For any one enterprise in charge of construction in the multiple enterprise in charge of construction;
After the completion of step 3 execution, step 4 and step 9 are performed respectively;
Looked forward to Step 4: getting first construction respectively according to the URL addresses of first enterprise in charge of construction
The sub- URL addresses of the company introduction data item of industry, the sub- URL addresses of honor information data item, the sub- URL for information data item of getting the bid
Address, the sub- URL addresses of operation information data item;After the completion of step 4 execution, following steps five, step are performed respectively
6th, Step 7: step 8;
Step 5: using the first crawlers according to the sub- URL address acquisitions of the company introduction data item to the public affairs
Content page corresponding to profile data item is taken charge of, and content page corresponding to parsing the company introduction data item obtains described first
The company introduction information of enterprise in charge of construction, by company introduction information storage into enterprise in charge of construction's information database;
And
Step 6: using the second crawlers according to the sub- URL address acquisitions of the honor information data item to the honor
Content page corresponding to information data item is praised, and content page corresponding to parsing the honor information data item obtains described first
The honor information of enterprise in charge of construction, by honor information storage into enterprise in charge of construction's information database;And
Step 7: using the 3rd crawlers according to the sub- URL address acquisitions of the acceptance of the bid information data item in described
Content page corresponding to information data item is marked, and parses the content page corresponding to information data item of getting the bid and obtains described first
The acceptance of the bid information of enterprise in charge of construction, by the acceptance of the bid information storage into enterprise in charge of construction's information database;And
Step 8: using the 4th crawlers according to the sub- URL address acquisitions of the operation information data item to the warp
Content page corresponding to information data item is sought, and content page corresponding to parsing the operation information data item obtains described first
The operation information of enterprise in charge of construction, by operation information storage into enterprise in charge of construction's information database;
Step 9: according to the enterprise name of first enterprise in charge of construction from construction market supervision and sincere distribution platform
On get the sub- URL addresses of qualification grade data item and the son of structure art information data item of first enterprise in charge of construction
URL addresses;
After the completion of step 9 execution, following steps ten, step 11 are performed respectively;
Step 10: using the 5th crawlers according to the sub- URL address acquisitions of the qualification grade data item to the money
Content page corresponding to matter level data item, and content page corresponding to parsing the qualification grade data item obtains described first
The qualification grade information of enterprise in charge of construction, enterprise in charge of construction's information database is arrived into qualification grade information storage
In;And
Step 11: using the 6th crawlers according to the sub- URL address acquisitions of the structure art information data item to institute
Content page corresponding to structure art information data item is stated, and content page corresponding to parsing the structure art information data item obtains
The structure art information of first enterprise in charge of construction, enterprise in charge of construction's Information Number is arrived into structure art information storage
According in storehouse.
On the other hand, the present invention provides a kind of big data acquisition system towards enterprise in charge of construction, including:
Company information acquisition module is more for being obtained successively according to the hierarchical regions position relationship where enterprise in charge of construction
The individual respective uniform resource position mark URL address of enterprise in charge of construction and corresponding enterprise name;
File storage module, for the respective URL addresses of the multiple enterprise in charge of construction to be saved in into URL texts
In, the respective enterprise name of the multiple enterprise in charge of construction is saved in name text file;
File read module, for reading the URL addresses of the first enterprise in charge of construction from the URL texts,
And the enterprise name of first enterprise in charge of construction is read from the name text file, wherein, described first builds
Construction enterprises are built as any one enterprise in charge of construction in the multiple enterprise in charge of construction;
After the completion of file read module execution, the first sub- URL acquisition modules are performed respectively and the second sub- URL is obtained
Module, wherein,
The first sub- URL acquisition modules, for being got respectively according to the URL addresses of first enterprise in charge of construction
The sub- URL addresses of the company introduction data item of first enterprise in charge of construction, the sub- URL addresses of honor information data item, in
Mark sub- URL addresses, the sub- URL addresses of operation information data item of information data item;Performed in the described first sub- URL acquisition modules
After the completion of, following first reptile processing module, the second reptile processing module are performed respectively, the 3rd reptile processing module, the 4th are climbed
Worm processing module;
First reptile processing module, for using the first crawlers according to the sub- URL of the company introduction data item
Location gets content page corresponding to the company introduction data item, and parses content net corresponding to the company introduction data item
Page obtains the company introduction information of first enterprise in charge of construction, and company introduction information storage is arrived into enterprise in charge of construction
In information database;And
Second reptile processing module, for using the second crawlers according to the sub- URL of the honor information data item
Location gets content page corresponding to the honor information data item, and parses content net corresponding to the honor information data item
Page obtains the honor information of first enterprise in charge of construction, and honor information storage is arrived into enterprise in charge of construction's information
In database;And
3rd reptile processing module, for using the 3rd crawlers according to the sub- URL of the acceptance of the bid information data item
Location gets content page corresponding to the acceptance of the bid information data item, and parses content net corresponding to the acceptance of the bid information data item
Page obtains the acceptance of the bid information of first enterprise in charge of construction, and the acceptance of the bid information storage is arrived into enterprise in charge of construction's information
In database;And
4th reptile processing module, for using the 4th crawlers according to the sub- URL of the operation information data item
Location gets content page corresponding to the operation information data item, and parses content net corresponding to the operation information data item
Page obtains the operation information of first enterprise in charge of construction, and operation information storage is arrived into enterprise in charge of construction's information
In database;
The second sub- URL acquisition modules, for the enterprise name according to first enterprise in charge of construction from building city
Supervision with sincere distribution platform on get first enterprise in charge of construction qualification grade data item sub- URL addresses and
The sub- URL addresses of structure art information data item;
After the completion of the described second sub- URL acquisition modules perform, following 5th reptile processing module is performed respectively, the 6th is climbed
Worm processing module;
5th reptile processing module, for using the 5th crawlers according to the sub- URL of the qualification grade data item
Location gets content page corresponding to the qualification grade data item, and parses content net corresponding to the qualification grade data item
Page obtains the qualification grade information of first enterprise in charge of construction, and qualification grade information storage is arrived into the construction
In company information data storehouse;And
6th reptile processing module, for the sub- URL using the 6th crawlers according to the structure art information data item
Content page corresponding to address acquisition to the structure art information data item, and parse corresponding to the structure art information data item
Content page obtains the structure art information of first enterprise in charge of construction, and structure art information storage is applied to the building
In work company information data storehouse.
After adopting the above technical scheme, technical scheme provided by the invention will have the following advantages:
The relevant information of enterprise in charge of construction can be captured in the embodiment of the present invention, each enterprise in charge of construction can grab
6 following data item are taken, are respectively:Company introduction, acceptance of the bid information, operation information, honor information, qualification grade, structure art
Information, and each data item uses independent crawlers, therefore each data item can be realized and crawled parallel, pass through building
The URL addresses of construction enterprises and enterprise name can capture the sub- URL addresses of each data item, therefore be adopted in the embodiment of the present invention
It is the search strategy of breadth First, the web page resources that enterprise in charge of construction is carried out in the industry can be directed to and crawled, improved
Crawl the speed of the web page resources of enterprise in charge of construction.
Brief description of the drawings
Fig. 1 provides a kind of process blocks of the big data acquisition method towards enterprise in charge of construction for the embodiment of the present invention and shown
It is intended to;
Fig. 2 is company introduction in the embodiment of the present invention, the flow that information, operation information, honor information data crawl of getting the bid
Figure
Fig. 3 is company's qualification grade in the embodiment of the present invention, the flow chart of structure art information crawler;
Fig. 4 is the composition structural representation of the big data acquisition system provided in an embodiment of the present invention towards enterprise in charge of construction
Figure.
Embodiment
The embodiments of the invention provide a kind of big data acquisition method and system towards enterprise in charge of construction, for providing
Suitable for the big data collection mechanism of construction industry, the web page resources collection to enterprise in charge of construction is realized.
To enable goal of the invention, feature, the advantage of the present invention more obvious and understandable, below in conjunction with the present invention
Accompanying drawing in embodiment, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that disclosed below
Embodiment be only part of the embodiment of the present invention, and not all embodiments.Based on the embodiment in the present invention, this area
The every other embodiment that technical staff is obtained, belongs to the scope of protection of the invention.
Term " first ", " second " in description and claims of this specification and above-mentioned accompanying drawing etc. are to be used to distinguish
Similar object, without for describing specific order or precedence.It should be appreciated that the term so used is in appropriate feelings
It can be exchanged under condition, this is only to describe object used differentiation in description in embodiments of the invention to same alike result
Mode.In addition, term " comprising " and " having " and their any deformation, it is intended that cover it is non-exclusive include, so as to
Process, method, system, product or equipment comprising a series of units are not necessarily limited to those units, but may include unclear
Other units that ground is listed or for these processes, method, product or equipment inherently.
It is described in detail individually below.
The present invention can be applied to construction enterprise towards one embodiment of the big data acquisition method of enterprise in charge of construction
The web page resources of industry crawl, and refer to shown in Fig. 1, the big data acquisition method provided by the invention towards enterprise in charge of construction,
One be may include steps of to the method and step shown in step 11:
Step 1: multiple enterprises in charge of construction are obtained successively according to the hierarchical regions position relationship where enterprise in charge of construction
Respective URL (Uniform/Universal Resource Locator, URL) address and corresponding enterprise
Industry title.
The data acquisition of enterprise in charge of construction is exclusively used in the embodiment of the present invention, hierarchical regions position relationship can include complete
It is domestic as province, city, county, Division enterprise in charge of construction where position capture successively, it is therefore an objective to obtain each province, city, county,
The url and exabyte com_name of Qu Suoyou companies.Wherein, the quantity of the enterprise in charge of construction got in the embodiment of the present invention
It can be determined according to actual scene.
Step 2: the respective URL addresses of the multiple enterprise in charge of construction are saved in URL texts, by described in
Multiple respective enterprise names of enterprise in charge of construction are saved in name text file.
In embodiments of the present invention, step one kind got the respective URL addresses of multiple enterprises in charge of construction and each
Enterprise name, then corresponding to deposit in url texts and com_name texts, in case crawling construction below
Use is read out during enterprise's related data.
Step 3: read the URL addresses of the first enterprise in charge of construction from the URL texts, and from described
The enterprise name of first enterprise in charge of construction is read in name text file, wherein, first enterprise in charge of construction
For any one enterprise in charge of construction in the multiple enterprise in charge of construction.
In embodiments of the present invention, what is preserved in URL texts and name text file is multiple enterprises in charge of construction
Company information, for ease of description illustrate, next with the big number of some enterprise in charge of construction in multiple enterprises in charge of construction
According to being described in detail exemplified by crawling, such as it is any one enterprise in charge of construction to define the first enterprise in charge of construction, then multiple
The big data of enterprise in charge of construction crawl can with reference to step 3 into step 11 to the big number of the first enterprise in charge of construction
According to implementing to complete exemplified by crawling, next by taking the processing procedure of the first enterprise in charge of construction as an example.
After the completion of step 3 execution, step 4 and step 9 are performed respectively.
Looked forward to Step 4: getting first construction respectively according to the URL addresses of first enterprise in charge of construction
The sub- URL addresses of the company introduction data item of industry, the sub- URL addresses of honor information data item, the sub- URL for information data item of getting the bid
Address, the sub- URL addresses of operation information data item;After the completion of step 4 execution, following steps five, step are performed respectively
6th, Step 7: step 8.
Step 5: using the first crawlers according to the sub- URL address acquisitions of the company introduction data item to the public affairs
Content page corresponding to profile data item is taken charge of, and content page corresponding to parsing the company introduction data item obtains described first
The company introduction information of enterprise in charge of construction, by company introduction information storage into enterprise in charge of construction's information database;
And
Step 6: using the second crawlers according to the sub- URL address acquisitions of the honor information data item to the honor
Content page corresponding to information data item is praised, and content page corresponding to parsing the honor information data item obtains described first
The honor information of enterprise in charge of construction, by honor information storage into enterprise in charge of construction's information database;And
Step 7: using the 3rd crawlers according to the sub- URL address acquisitions of the acceptance of the bid information data item in described
Content page corresponding to information data item is marked, and parses the content page corresponding to information data item of getting the bid and obtains described first
The acceptance of the bid information of enterprise in charge of construction, by the acceptance of the bid information storage into enterprise in charge of construction's information database;And
Step 8: using the 4th crawlers according to the sub- URL address acquisitions of the operation information data item to the warp
Content page corresponding to information data item is sought, and content page corresponding to parsing the operation information data item obtains described first
The operation information of enterprise in charge of construction, by operation information storage into enterprise in charge of construction's information database.
Wherein, step 5 each crawlers into step 8 are separate, can use scrapy frameworks respectively
To complete, for the corresponding independent crawlers of each data item in step 5 to step 8 so that data item it
Between crawl and be independent of each other, crawl speed and have a certain upgrade.Specifically, in step 5, what company introduction data item can refer to
It is the industrial and commercial registration information of enterprise in charge of construction, such as can includes:The enterprise name of enterprise in charge of construction and unique mark, enterprise
Industry sets up time, registration authority, shareholder's information, registered capital and investments abroad situation.Honor information data item also refers to
Honor item acquired in enterprise in charge of construction, such as the prize-winning situation of the prize-winning situation of engineering and enterprise of the enterprise in charge of construction.
Acceptance of the bid information data item also refers to acceptance of the bid situation of the enterprise in charge of construction in engineering undertaking, and operation information data item can be with
Refer to the management state information of enterprise in charge of construction.
Step 9: according to the enterprise name of first enterprise in charge of construction from construction market supervision and sincere distribution platform
On get the sub- URL addresses of qualification grade data item and the son of structure art information data item of first enterprise in charge of construction
URL addresses.
Wherein, enterprise qualification feelings of the construction market supervision with being stored with multiple enterprises in charge of construction in sincere distribution platform
Condition, if inquiring about the enterprise qualification situation less than enterprise in charge of construction in construction market supervision and sincere distribution platform, illustrate
The enterprise in charge of construction does not possess constructional enterprises qualification, can also be from building for the enterprise with constructional enterprises qualification
Market surpervision gets qualification grade data and structure art information data with sincere distribution platform.
After the completion of step 9 execution, following steps ten, step 11 are performed respectively.
Step 10: using the 5th crawlers according to the sub- URL address acquisitions of the qualification grade data item to the money
Content page corresponding to matter level data item, and content page corresponding to parsing the qualification grade data item obtains described first
The qualification grade information of enterprise in charge of construction, enterprise in charge of construction's information database is arrived into qualification grade information storage
In;And
Step 11: using the 6th crawlers according to the sub- URL address acquisitions of the structure art information data item to institute
Content page corresponding to structure art information data item is stated, and content page corresponding to parsing the structure art information data item obtains
The structure art information of first enterprise in charge of construction, enterprise in charge of construction's Information Number is arrived into structure art information storage
According in storehouse.
In embodiments of the present invention, the mass data of internet can be used to adopt the credit data of enterprise in charge of construction
Collection and excavation, the enterprise in charge of construction's data excavated can be used for generation new firms credit appraisal pattern, and be exclusively used in building
The credit appraisal of construction enterprises.Reptile instrument can be used to crawl the enterprise of enterprise in charge of construction from big data in the embodiment of the present invention
Industry specifying information, each enterprise in charge of construction can capture 6 following data item, be respectively:Company introduction, acceptance of the bid letter
Breath, operation information, honor information, qualification grade, structure art information.Applied for example, it is also possible to crawl building by big data technology
Industrial and commercial registration information, constructional enterprises qualification grade standard information, honor information and the history of work enterprise are honoured an agreement behavioural information,
In the embodiment of the present invention, the industrial and commercial registration information of enterprise in charge of construction can include:The enterprise name of enterprise in charge of construction with only
One mark, enterprise set up time, registration authority, shareholder's information, registered capital and investments abroad situation.The honor of enterprise in charge of construction
Reputation information can include:Engineering awards, enterprise's awards and developmental achievement that enterprise in charge of construction is obtained etc..History is honoured an agreement behavior
Information includes:The acceptance of the bid number information and event of default information of enterprise in charge of construction, event of default information include:Each building is applied
The record of bad behavior information of work enterprise and law court are performed case information.Each data item is using independent in the embodiment of the present invention
Crawlers, therefore each data item can be realized and crawled parallel, can by the URL addresses of enterprise in charge of construction and enterprise name
, can pin to capture the sub- URL addresses of each data item, therefore using the search strategy of breadth First in the embodiment of the present invention
The web page resources in the industry are carried out to enterprise in charge of construction to crawl, and improve the speed for the web page resources for crawling enterprise in charge of construction
Degree.
In some embodiments of the invention, the step 4, specifically comprises the following steps:
It is super literary according to the homepage of the URL address acquisitions of first enterprise in charge of construction to first enterprise in charge of construction
This markup language html web page;
Homepage html web page is parsed, all hyperlink of the homepage html web page are scanned for, completes first level pages
Traversal after proceed by the search of the two level page again, by this endless form untill underlying pages search is completed, searched for
Into the sub- URL addresses of the company introduction data item of rear output first enterprise in charge of construction, the sub- URL of honor information data item
Address, sub- URL addresses, the sub- URL addresses of operation information data item of information data item of getting the bid.
It is illustrated below, parses the homepage HTML of the first enterprise in charge of construction with BeautifulSoup first, acquisition is treated
The sub- url of data item is crawled, then circulation obtains the url (i.e. content url) of each single item data content under the sub- url, uses
The HTML of BeautifulSoup parsing content pages is simultaneously extracted in information needed deposit database.
Further, in some embodiments of the invention, the URL addresses according to first enterprise in charge of construction
The homepage HTML html web page of first enterprise in charge of construction is got, is comprised the following steps:
When the URL addresses of first enterprise in charge of construction are specially login page URL addresses, looked forward to by construction
Industry inquiring client terminal inputs username and password;
After logining successfully, judge whether verified on the homepage html web page of first enterprise in charge of construction
Code;
If not occurring identifying code on the homepage html web page, triggering performs the step of parsing homepage html web page;Or
Person,
If there is identifying code on the homepage html web page, the identifying code picture on the homepage html web page is identified,
Obtain the pictorial information on the identifying code picture;The pictorial information is submitted into server to be verified;When identifying code is known
Not by rear, triggering performs the step of parsing homepage html web page;When identifying code identification not over when, re-recognize the head
Identifying code picture on page html web page.
In some embodiments of the invention, the step 5, specifically comprises the following steps:
Using the first crawlers, circulation obtains each single item company under the sub- URL addresses of the company introduction data item
The URL addresses of profile data content, and parse html web page corresponding to the URL addresses of each single item company introduction data content
The company introduction information of first enterprise in charge of construction is obtained, the company introduction information is stored to enterprise in charge of construction and believed
Cease in database.
It should be noted that in embodiments of the present invention, foregoing teachings are lifted to the specific implementation of step 5
Example explanation, wherein the specific implementation of step 6 to step 8 can also be similar with step 5, does not repeat herein.
In some embodiments of the invention, the step 10, specifically comprises the following steps:
Judge that first enterprise in charge of construction whether there is institute on the construction market supervision and sincere distribution platform
State qualification grade data item corresponding to the enterprise name of the first enterprise in charge of construction;
If the qualification of the construction market supervision with first enterprise in charge of construction is not present on sincere distribution platform etc.
DBMS item, then continue to read the enterprise name of next enterprise in charge of construction from the name text file;
If the construction market supervision and the qualification grade that first enterprise in charge of construction on sincere distribution platform be present
Subdata item, the qualification grade content page of first enterprise in charge of construction is got using the 5th crawlers, solved
The qualification grade content page is analysed, extracts the CompanyName of first enterprise in charge of construction, qualification classification, qualification certificates
Number, qualification title, data of issue, validity period of certificate, licence issuing authority;
By the CompanyName of first enterprise in charge of construction, qualification classification, qualification certificates number, qualification title, date of issue
Phase, validity period of certificate, licence issuing authority successively sequential storage into enterprise in charge of construction's information database.
In some embodiments of the invention, the step 11, specifically comprises the following steps:
Judge that first enterprise in charge of construction whether there is institute on the construction market supervision and sincere distribution platform
State structure art information data item corresponding to the enterprise name of the first enterprise in charge of construction;
If the construction market supervision and the structure art that first enterprise in charge of construction is not present on sincere distribution platform
Information data item, then continue to read the enterprise name of next enterprise in charge of construction from the name text file;
If the construction market supervision is believed with the structure art of first enterprise in charge of construction on sincere distribution platform be present
Data item is ceased, the structure art content page of first enterprise in charge of construction, parsing are got using the 6th crawlers
The structure art content page, extract the structure art name of first enterprise in charge of construction, identification card number, registration classification,
Number of registration, registration specialty;
By the structure art name of first enterprise in charge of construction, identification card number, registration classification, number of registration, registration specialty
Sequential storage is into enterprise in charge of construction's information database successively.
In some embodiments of the invention, first crawlers, second crawlers, the 3rd reptile
Program, the 4th crawlers, the 5th crawlers and the 6th crawlers pass through construction simultaneously respectively
Enterprise query client crawls corresponding data item to be crawled.
Wherein, the first crawlers, the second crawlers, the 3rd crawlers, the 4th crawlers, the 5th crawlers
It is the reptile instrument that independently performs with the 6th crawlers, each reptile instrument can perform simultaneously, so as to improve data
Item crawls efficiency.
In some embodiments of the invention, first crawlers, second crawlers, the 3rd reptile
Program, the 4th crawlers, the 5th crawlers and the 6th crawlers all use separate
Scrapy frameworks.
Wherein, for the implementation process of scrapy frameworks, prior art can be referred to.
For ease of being better understood from and implementing the such scheme of the embodiment of the present invention, corresponding application scenarios of illustrating below come
It is specifically described.
In the prior art, web crawlers is to crawl web terminal mostly, but for need user name, password login website,
Account is easily lead to by the substantial amounts of gathered data of web terminal to be closed, so as to cause big data collection failure.And the present invention is real
The website logged in example for needs is applied, by crawling cell-phone customer terminal gathered data, effectively reduces what account was closed
Risk, it thereby may be ensured that being smoothed out for big data collection.
The search strategy of breadth First is taken in the embodiment of the present invention, i.e., is started the cycle over from top layer to bottom, first with regard to one-level
All hyperlink in the page scan for, and complete to start the search circulation of the two level page after first level pages travel through again, the bottom of until
Untill layer.Scrapy frameworks, the corresponding independent crawlers of each data item so that number are used in the embodiment of the present invention
It is independent of each other according to crawling between item, crawls speed and have a certain upgrade.
Existing reptile is not related to the function of identifying code identification typically.Identification is added in the embodiment of the present invention in a program to test
The function of code is demonstrate,proved, program further simulates manually, reduces the anti-risk climbed, and also saves and manually enters identifying code
Time so that crawlers are more smooth.
The embodiment of the present invention is acquired just for the related data of construction industry company, and the embodiment of the present invention provides one kind and grabbed
The collecting method of national all construction industry companies relevant information is taken, 6 data item are captured per a company, is respectively:It is public
Take charge of brief introduction, acceptance of the bid information, operation information, honor information, qualification grade, structure art information.Each corresponding independence of data item
Crawlers, each crawlers use scrapy frameworks.This 6 partial information comes from two websites, is respectively:Build
The logical, platform of four storehouse one.Wherein, company introduction, acceptance of the bid information, operation information, honor information this four partial information come from construction
Logical cell-phone customer terminal;This two parts information of qualification grade, structure art information comes from the platform of four storehouse one.
First, company introduction, acceptance of the bid information, operation information, honor information crawl.
1st, company URL is crawled.
Captured successively by province, city, county, area, it is therefore an objective to obtain the url and exabyte of each province, city, county, Qu Suoyou companies
Com_name, and in url texts and com_name texts corresponding to deposit, in case crawling company related data below
Use.
2nd, the contiguous items content of each company is crawled.
A) non-login just retrievable data item, including company introduction, honor information.
Program reads company's url texts line by line, is successively read the url of every company;BeautifulSoup is used first
(a Python storehouse that data can be extracted from HTML or XML) parses company homepage HTML, obtains data item to be crawled
Url (sub- url);Then circulation obtains the url (content url) of each single item data content under the sub- url;BeautifulSoup
Parse the HTML of content pages and extract in information needed deposit database.
B) need to log in just retrievable data item, including acceptance of the bid information, operation information.
Using the login page url of the website as the entrance of program, program write-in user name, password, manual entry is simulated,
Data message needed for being captured after logining successfully by method described in a).From a) in unlike, each time obtain url after,
Before parsing HTML, the judgement of " whether identifying code occur " carried out.If there is not identifying code, directly solved after obtaining url
Analyse HTML;If there is identifying code, then identifying code picture is first identified, be verified the numeral of code, step is as follows:1) parsing is current
The page obtains the url of identifying code picture, and request is verified yard picture and opened;2) HIS color spaces are transformed into using RBG,
Colour picture is converted into by gray level image using L * component;3) using direct thresholding to image noise reduction, threshold value takes 90, and (value is
Obtained by many experiments), the pixel more than threshold value is set to 1, others are set to 0;4) (it is used for picture text with pytesser
The Python modules of this identification) in image_to_string functions the character in picture is converted into text, just tested
Demonstrate,prove yardage word.Obtained identifying code numeral is submitted into server, if identifying code is submitted correctly, continues to parse HTML, if testing
Card code submission is incorrect, then is performed by the situation continued of " identifying code occur ".The subprogram flow chart is as shown in Figure 2.
2nd, qualification grade, structure art information crawl, and the subprogram flow chart is as shown in Figure 3.
1st, qualification grade crawls.
The data item needs to crawl 6 sub- data item, is respectively:Qualification classification, qualification certificates number, qualification title, issue licence
Date, validity period of certificate, licence issuing authority.
Com_ texts are read line by line, are crawled one by one by exabyte.
A) first determine whether that the said firm whether there is in the platform of four storehouse one, if it does not, directly skip, continue to read next
Individual company;If it is present perform b) step.
B) judge whether the said firm has qualification grade correlator data item information, if not provided, directly skipping, continue to read
Lower a company;If so, then perform c) step.
C) qualification grade page HTML is parsed, extracts 6 sub- data item information therein, and by CompanyName, qualification class
Not, qualification certificates number, qualification title, data of issue, validity period of certificate, the order of licence issuing authority are sequentially stored into database.
2nd, structure art information crawls.
The data item needs to crawl 5 sub- data item, is respectively:Structure art name, identification card number, registration classification, registration
Number (operation seal number), registration specialty.
Wherein, data crawl that flow is similar with crawling for qualification grade, refer to the illustration of previous embodiment.
The present invention has used scrapy for needing the website of user name, password login to be acquired by APP clients
Reptile framework, each data item use independent crawlers, and using the search strategy of breadth First, identifying code identification division makes
Calling program preferably simulates manually, reduces the anti-risk climbed, also improves the speed crawled to a certain extent.
Refer to shown in Fig. 4, the embodiment of the present application also provides a kind of big data acquisition system towards enterprise in charge of construction
400, including:
Company information acquisition module 401, for being obtained successively according to the hierarchical regions position relationship where enterprise in charge of construction
Take the respective uniform resource position mark URL address of multiple enterprises in charge of construction and corresponding enterprise name;
File storage module 402, for the respective URL addresses of the multiple enterprise in charge of construction to be saved in into URL texts
In file, the respective enterprise name of the multiple enterprise in charge of construction is saved in name text file;
File read module 403, for reading from the URL texts URL of the first enterprise in charge of construction
Location, and the enterprise name of first enterprise in charge of construction is read from the name text file, wherein, described first
Enterprise in charge of construction is any one enterprise in charge of construction in the multiple enterprise in charge of construction;
After the completion of the file read module 403 execution, the first sub- of URL acquisition modules 404 and second is performed respectively
URL acquisition modules 409, wherein,
The first sub- URL acquisition modules 404, for being obtained respectively according to the URL addresses of first enterprise in charge of construction
Get the sub- URL addresses of the company introduction data item of first enterprise in charge of construction, the sub- URL of honor information data item
Location, sub- URL addresses, the sub- URL addresses of operation information data item of information data item of getting the bid;Mould is obtained in the described first sub- URL
Block perform after the completion of, perform respectively following first reptile processing module, the second reptile processing module, the 3rd reptile processing module,
4th reptile processing module;
First reptile processing module 405, for the sub- URL using the first crawlers according to the company introduction data item
Content page corresponding to address acquisition to the company introduction data item, and parse content corresponding to the company introduction data item
Webpage obtains the company introduction information of first enterprise in charge of construction, and company introduction information storage is looked forward to construction
In industry information database;And
Second reptile processing module 406, for the sub- URL using the second crawlers according to the honor information data item
Content page corresponding to address acquisition to the honor information data item, and parse content corresponding to the honor information data item
Webpage obtains the honor information of first enterprise in charge of construction, and honor information storage is believed to the enterprise in charge of construction
Cease in database;And
3rd reptile processing module 407, for the sub- URL using the 3rd crawlers according to the acceptance of the bid information data item
Content page corresponding to address acquisition to the acceptance of the bid information data item, and parse content corresponding to the acceptance of the bid information data item
Webpage obtains the acceptance of the bid information of first enterprise in charge of construction, and the acceptance of the bid information storage is believed to the enterprise in charge of construction
Cease in database;And
4th reptile processing module 408, for the sub- URL using the 4th crawlers according to the operation information data item
Content page corresponding to address acquisition to the operation information data item, and parse content corresponding to the operation information data item
Webpage obtains the operation information of first enterprise in charge of construction, and operation information storage is believed to the enterprise in charge of construction
Cease in database;
The second sub- URL acquisition modules 409, for the enterprise name according to first enterprise in charge of construction from building
The sub- URL addresses of qualification grade data item of the market surpervision with getting first enterprise in charge of construction on sincere distribution platform
With the sub- URL addresses of structure art information data item;
After the completion of the described second sub- URL acquisition modules 409 perform, perform respectively following 5th reptile processing module 410,
6th reptile processing module 411;
5th reptile processing module 410, for the sub- URL using the 5th crawlers according to the qualification grade data item
Content page corresponding to address acquisition to the qualification grade data item, and parse content corresponding to the qualification grade data item
Webpage obtains the qualification grade information of first enterprise in charge of construction, and qualification grade information storage is applied to the building
In work company information data storehouse;And
6th reptile processing module 411, for the son using the 6th crawlers according to the structure art information data item
Content page corresponding to URL address acquisitions to the structure art information data item, and it is right to parse the structure art information data item
The content page answered obtains the structure art information of first enterprise in charge of construction, and structure art information storage is built described in
Build in construction enterprises' information database.
It should be noted that, device embodiment described above is only schematical in addition, wherein described as separation
The unit of part description can be or may not be it is physically separate, can be as the part that unit is shown or
It can not be physical location, you can with positioned at a place, or can also be distributed on multiple NEs.Can be according to reality
Border needs to select some or all of module therein to realize the purpose of this embodiment scheme.It is in addition, provided by the invention
In device embodiment accompanying drawing, the annexation between module represents there is communication connection between them, specifically can be implemented as one
Bar or a plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, you can with
Understand and implement.
Through the above description of the embodiments, it is apparent to those skilled in the art that the present invention can borrow
Software is helped to add the mode of required common hardware to realize, naturally it is also possible to include application specific integrated circuit, specially by specialized hardware
Realized with CPU, private memory, special components and parts etc..Generally, all functions of being completed by computer program can
Easily realized with corresponding hardware, moreover, for realizing that the particular hardware structure of same function can also be a variety of more
Sample, such as analog circuit, digital circuit or special circuit etc..But it is more for the purpose of the present invention in the case of software program it is real
It is now more preferably embodiment.Based on such understanding, technical scheme is substantially made to prior art in other words
The part of contribution can be embodied in the form of software product, and the computer software product is stored in the storage medium that can be read
In, such as the floppy disk of computer, USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), random access memory
Device (RAM, Random Access Memory), magnetic disc or CD etc., including some instructions are causing a computer to set
Standby (can be personal computer, server, or network equipment etc.) performs the method described in each embodiment of the present invention.
In summary, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to upper
Embodiment is stated the present invention is described in detail, it will be understood by those within the art that:It still can be to upper
State the technical scheme described in each embodiment to modify, or equivalent substitution is carried out to which part technical characteristic;And these
Modification is replaced, and the essence of appropriate technical solution is departed from the spirit and scope of various embodiments of the present invention technical scheme.