CN107590236A - A kind of big data acquisition method and system towards enterprise in charge of construction - Google Patents

A kind of big data acquisition method and system towards enterprise in charge of construction Download PDF

Info

Publication number
CN107590236A
CN107590236A CN201710809082.8A CN201710809082A CN107590236A CN 107590236 A CN107590236 A CN 107590236A CN 201710809082 A CN201710809082 A CN 201710809082A CN 107590236 A CN107590236 A CN 107590236A
Authority
CN
China
Prior art keywords
construction
enterprise
charge
data item
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710809082.8A
Other languages
Chinese (zh)
Other versions
CN107590236B (en
Inventor
张子柯
王朝
毛江群
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Number Cube Credit Co Ltd
Original Assignee
Hangzhou Number Cube Credit Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Number Cube Credit Co Ltd filed Critical Hangzhou Number Cube Credit Co Ltd
Priority to CN201710809082.8A priority Critical patent/CN107590236B/en
Publication of CN107590236A publication Critical patent/CN107590236A/en
Application granted granted Critical
Publication of CN107590236B publication Critical patent/CN107590236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of big data acquisition method and system towards enterprise in charge of construction, for providing the big data collection mechanism suitable for construction industry, realizes the web page resources collection to enterprise in charge of construction.The relevant information of enterprise in charge of construction can be captured in the embodiment of the present invention, each enterprise in charge of construction can capture 6 following data item, be respectively:Company introduction, acceptance of the bid information, operation information, honor information, qualification grade, structure art information, and each data item uses independent crawlers, therefore each data item can be realized crawls parallel, the sub- URL addresses of each data item can be captured by the URL addresses of enterprise in charge of construction and enterprise name, therefore using the search strategy of breadth First in the embodiment of the present invention, the web page resources that enterprise in charge of construction is carried out in the industry can be directed to crawl, improve the speed for the web page resources for crawling enterprise in charge of construction.

Description

A kind of big data acquisition method and system towards enterprise in charge of construction
Technical field
The present invention relates to the data acquisition technology field of construction industry, and in particular to a kind of towards enterprise in charge of construction Big data acquisition method and system.
Background technology
With developing rapidly for network (web), WWW turns into the carrier of bulk information, how to efficiently extract and utilize These information turn into a huge challenge.Search engine (SearchEngine), such as traditional pass through search engine Yahoo!With Google etc., the instrument of information is retrieved as an auxiliary people turns into the entrance and guide that user accesses web.But To be that these are general search plain engine can not capture the web page resources of orientation there is also certain limitation.
In order to solve the above problems, web crawlers arises at the historic moment, and web crawlers can orient crawl related web page resource.It It is an automatic program for downloading webpage, it selectively accesses the webpage and phase on WWW according to set crawl target The link of pass, so as to obtain the information required for user.
In the prior art, web crawlers is to obtain information by crawling web page mostly, and mostly excellent using depth First search strategy, i.e., in a HTML (HyperText Markup Language, HTML) file, choosing Select one of hyperlink label and carry out deep search, when being hyperlinked to the bottom up to traveling through this, judged by logical operation This layer of search terminates, and with backed off after random this layer circulation, returns to upper strata and circulates and start to search for other hyperlink labels, until initial Hyperlink in file is traversed completion, and Internet protocol (Internet Protocol, IP) generation can be set during crawling Reason, so as to prevent anti-reptile.
In the prior art, web crawlers uses the search strategy of depth-first mostly, and in data structure increasingly complexity In the case of, longitudinal level of website can infinitely increase and cross reference occurs between different levels, it may occur that Infinite Cyclic Situation, only bolt down procedure can just exit traversal by force, and obtained information is because largely repetition and redundancy, quality are difficult to protect Card, once program, in the process of implementation because network or other reasons are interrupted, the point of interruption is difficult to find out, and also can not just determine reptile The starting point restarted, thus it is less efficient to crawling for web page resources in the prior art.
In addition, web crawlers is not the web page crawl for being specially adapted for construction industry in the prior art, do not examine Consider the characteristic of the industry, therefore be badly in need of a kind of web page resources suitable for construction industry and crawl scheme.
The content of the invention
It is an object of the invention to provide a kind of big data acquisition method and system towards enterprise in charge of construction, for carrying For the big data collection mechanism suitable for construction industry, the web page resources collection to enterprise in charge of construction is realized.
In order to achieve the above object, following technical scheme as present invention use:
On the one hand, the present invention provides a kind of big data acquisition method towards enterprise in charge of construction, including:
Step 1: multiple enterprises in charge of construction are obtained successively according to the hierarchical regions position relationship where enterprise in charge of construction Respective uniform resource position mark URL address and corresponding enterprise name;
Step 2: the respective URL addresses of the multiple enterprise in charge of construction are saved in URL texts, by described in Multiple respective enterprise names of enterprise in charge of construction are saved in name text file;
Step 3: read the URL addresses of the first enterprise in charge of construction from the URL texts, and from described The enterprise name of first enterprise in charge of construction is read in name text file, wherein, first enterprise in charge of construction For any one enterprise in charge of construction in the multiple enterprise in charge of construction;
After the completion of step 3 execution, step 4 and step 9 are performed respectively;
Looked forward to Step 4: getting first construction respectively according to the URL addresses of first enterprise in charge of construction The sub- URL addresses of the company introduction data item of industry, the sub- URL addresses of honor information data item, the sub- URL for information data item of getting the bid Address, the sub- URL addresses of operation information data item;After the completion of step 4 execution, following steps five, step are performed respectively 6th, Step 7: step 8;
Step 5: using the first crawlers according to the sub- URL address acquisitions of the company introduction data item to the public affairs Content page corresponding to profile data item is taken charge of, and content page corresponding to parsing the company introduction data item obtains described first The company introduction information of enterprise in charge of construction, by company introduction information storage into enterprise in charge of construction's information database; And
Step 6: using the second crawlers according to the sub- URL address acquisitions of the honor information data item to the honor Content page corresponding to information data item is praised, and content page corresponding to parsing the honor information data item obtains described first The honor information of enterprise in charge of construction, by honor information storage into enterprise in charge of construction's information database;And
Step 7: using the 3rd crawlers according to the sub- URL address acquisitions of the acceptance of the bid information data item in described Content page corresponding to information data item is marked, and parses the content page corresponding to information data item of getting the bid and obtains described first The acceptance of the bid information of enterprise in charge of construction, by the acceptance of the bid information storage into enterprise in charge of construction's information database;And
Step 8: using the 4th crawlers according to the sub- URL address acquisitions of the operation information data item to the warp Content page corresponding to information data item is sought, and content page corresponding to parsing the operation information data item obtains described first The operation information of enterprise in charge of construction, by operation information storage into enterprise in charge of construction's information database;
Step 9: according to the enterprise name of first enterprise in charge of construction from construction market supervision and sincere distribution platform On get the sub- URL addresses of qualification grade data item and the son of structure art information data item of first enterprise in charge of construction URL addresses;
After the completion of step 9 execution, following steps ten, step 11 are performed respectively;
Step 10: using the 5th crawlers according to the sub- URL address acquisitions of the qualification grade data item to the money Content page corresponding to matter level data item, and content page corresponding to parsing the qualification grade data item obtains described first The qualification grade information of enterprise in charge of construction, enterprise in charge of construction's information database is arrived into qualification grade information storage In;And
Step 11: using the 6th crawlers according to the sub- URL address acquisitions of the structure art information data item to institute Content page corresponding to structure art information data item is stated, and content page corresponding to parsing the structure art information data item obtains The structure art information of first enterprise in charge of construction, enterprise in charge of construction's Information Number is arrived into structure art information storage According in storehouse.
On the other hand, the present invention provides a kind of big data acquisition system towards enterprise in charge of construction, including:
Company information acquisition module is more for being obtained successively according to the hierarchical regions position relationship where enterprise in charge of construction The individual respective uniform resource position mark URL address of enterprise in charge of construction and corresponding enterprise name;
File storage module, for the respective URL addresses of the multiple enterprise in charge of construction to be saved in into URL texts In, the respective enterprise name of the multiple enterprise in charge of construction is saved in name text file;
File read module, for reading the URL addresses of the first enterprise in charge of construction from the URL texts, And the enterprise name of first enterprise in charge of construction is read from the name text file, wherein, described first builds Construction enterprises are built as any one enterprise in charge of construction in the multiple enterprise in charge of construction;
After the completion of file read module execution, the first sub- URL acquisition modules are performed respectively and the second sub- URL is obtained Module, wherein,
The first sub- URL acquisition modules, for being got respectively according to the URL addresses of first enterprise in charge of construction The sub- URL addresses of the company introduction data item of first enterprise in charge of construction, the sub- URL addresses of honor information data item, in Mark sub- URL addresses, the sub- URL addresses of operation information data item of information data item;Performed in the described first sub- URL acquisition modules After the completion of, following first reptile processing module, the second reptile processing module are performed respectively, the 3rd reptile processing module, the 4th are climbed Worm processing module;
First reptile processing module, for using the first crawlers according to the sub- URL of the company introduction data item Location gets content page corresponding to the company introduction data item, and parses content net corresponding to the company introduction data item Page obtains the company introduction information of first enterprise in charge of construction, and company introduction information storage is arrived into enterprise in charge of construction In information database;And
Second reptile processing module, for using the second crawlers according to the sub- URL of the honor information data item Location gets content page corresponding to the honor information data item, and parses content net corresponding to the honor information data item Page obtains the honor information of first enterprise in charge of construction, and honor information storage is arrived into enterprise in charge of construction's information In database;And
3rd reptile processing module, for using the 3rd crawlers according to the sub- URL of the acceptance of the bid information data item Location gets content page corresponding to the acceptance of the bid information data item, and parses content net corresponding to the acceptance of the bid information data item Page obtains the acceptance of the bid information of first enterprise in charge of construction, and the acceptance of the bid information storage is arrived into enterprise in charge of construction's information In database;And
4th reptile processing module, for using the 4th crawlers according to the sub- URL of the operation information data item Location gets content page corresponding to the operation information data item, and parses content net corresponding to the operation information data item Page obtains the operation information of first enterprise in charge of construction, and operation information storage is arrived into enterprise in charge of construction's information In database;
The second sub- URL acquisition modules, for the enterprise name according to first enterprise in charge of construction from building city Supervision with sincere distribution platform on get first enterprise in charge of construction qualification grade data item sub- URL addresses and The sub- URL addresses of structure art information data item;
After the completion of the described second sub- URL acquisition modules perform, following 5th reptile processing module is performed respectively, the 6th is climbed Worm processing module;
5th reptile processing module, for using the 5th crawlers according to the sub- URL of the qualification grade data item Location gets content page corresponding to the qualification grade data item, and parses content net corresponding to the qualification grade data item Page obtains the qualification grade information of first enterprise in charge of construction, and qualification grade information storage is arrived into the construction In company information data storehouse;And
6th reptile processing module, for the sub- URL using the 6th crawlers according to the structure art information data item Content page corresponding to address acquisition to the structure art information data item, and parse corresponding to the structure art information data item Content page obtains the structure art information of first enterprise in charge of construction, and structure art information storage is applied to the building In work company information data storehouse.
After adopting the above technical scheme, technical scheme provided by the invention will have the following advantages:
The relevant information of enterprise in charge of construction can be captured in the embodiment of the present invention, each enterprise in charge of construction can grab 6 following data item are taken, are respectively:Company introduction, acceptance of the bid information, operation information, honor information, qualification grade, structure art Information, and each data item uses independent crawlers, therefore each data item can be realized and crawled parallel, pass through building The URL addresses of construction enterprises and enterprise name can capture the sub- URL addresses of each data item, therefore be adopted in the embodiment of the present invention It is the search strategy of breadth First, the web page resources that enterprise in charge of construction is carried out in the industry can be directed to and crawled, improved Crawl the speed of the web page resources of enterprise in charge of construction.
Brief description of the drawings
Fig. 1 provides a kind of process blocks of the big data acquisition method towards enterprise in charge of construction for the embodiment of the present invention and shown It is intended to;
Fig. 2 is company introduction in the embodiment of the present invention, the flow that information, operation information, honor information data crawl of getting the bid Figure
Fig. 3 is company's qualification grade in the embodiment of the present invention, the flow chart of structure art information crawler;
Fig. 4 is the composition structural representation of the big data acquisition system provided in an embodiment of the present invention towards enterprise in charge of construction Figure.
Embodiment
The embodiments of the invention provide a kind of big data acquisition method and system towards enterprise in charge of construction, for providing Suitable for the big data collection mechanism of construction industry, the web page resources collection to enterprise in charge of construction is realized.
To enable goal of the invention, feature, the advantage of the present invention more obvious and understandable, below in conjunction with the present invention Accompanying drawing in embodiment, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that disclosed below Embodiment be only part of the embodiment of the present invention, and not all embodiments.Based on the embodiment in the present invention, this area The every other embodiment that technical staff is obtained, belongs to the scope of protection of the invention.
Term " first ", " second " in description and claims of this specification and above-mentioned accompanying drawing etc. are to be used to distinguish Similar object, without for describing specific order or precedence.It should be appreciated that the term so used is in appropriate feelings It can be exchanged under condition, this is only to describe object used differentiation in description in embodiments of the invention to same alike result Mode.In addition, term " comprising " and " having " and their any deformation, it is intended that cover it is non-exclusive include, so as to Process, method, system, product or equipment comprising a series of units are not necessarily limited to those units, but may include unclear Other units that ground is listed or for these processes, method, product or equipment inherently.
It is described in detail individually below.
The present invention can be applied to construction enterprise towards one embodiment of the big data acquisition method of enterprise in charge of construction The web page resources of industry crawl, and refer to shown in Fig. 1, the big data acquisition method provided by the invention towards enterprise in charge of construction, One be may include steps of to the method and step shown in step 11:
Step 1: multiple enterprises in charge of construction are obtained successively according to the hierarchical regions position relationship where enterprise in charge of construction Respective URL (Uniform/Universal Resource Locator, URL) address and corresponding enterprise Industry title.
The data acquisition of enterprise in charge of construction is exclusively used in the embodiment of the present invention, hierarchical regions position relationship can include complete It is domestic as province, city, county, Division enterprise in charge of construction where position capture successively, it is therefore an objective to obtain each province, city, county, The url and exabyte com_name of Qu Suoyou companies.Wherein, the quantity of the enterprise in charge of construction got in the embodiment of the present invention It can be determined according to actual scene.
Step 2: the respective URL addresses of the multiple enterprise in charge of construction are saved in URL texts, by described in Multiple respective enterprise names of enterprise in charge of construction are saved in name text file.
In embodiments of the present invention, step one kind got the respective URL addresses of multiple enterprises in charge of construction and each Enterprise name, then corresponding to deposit in url texts and com_name texts, in case crawling construction below Use is read out during enterprise's related data.
Step 3: read the URL addresses of the first enterprise in charge of construction from the URL texts, and from described The enterprise name of first enterprise in charge of construction is read in name text file, wherein, first enterprise in charge of construction For any one enterprise in charge of construction in the multiple enterprise in charge of construction.
In embodiments of the present invention, what is preserved in URL texts and name text file is multiple enterprises in charge of construction Company information, for ease of description illustrate, next with the big number of some enterprise in charge of construction in multiple enterprises in charge of construction According to being described in detail exemplified by crawling, such as it is any one enterprise in charge of construction to define the first enterprise in charge of construction, then multiple The big data of enterprise in charge of construction crawl can with reference to step 3 into step 11 to the big number of the first enterprise in charge of construction According to implementing to complete exemplified by crawling, next by taking the processing procedure of the first enterprise in charge of construction as an example.
After the completion of step 3 execution, step 4 and step 9 are performed respectively.
Looked forward to Step 4: getting first construction respectively according to the URL addresses of first enterprise in charge of construction The sub- URL addresses of the company introduction data item of industry, the sub- URL addresses of honor information data item, the sub- URL for information data item of getting the bid Address, the sub- URL addresses of operation information data item;After the completion of step 4 execution, following steps five, step are performed respectively 6th, Step 7: step 8.
Step 5: using the first crawlers according to the sub- URL address acquisitions of the company introduction data item to the public affairs Content page corresponding to profile data item is taken charge of, and content page corresponding to parsing the company introduction data item obtains described first The company introduction information of enterprise in charge of construction, by company introduction information storage into enterprise in charge of construction's information database; And
Step 6: using the second crawlers according to the sub- URL address acquisitions of the honor information data item to the honor Content page corresponding to information data item is praised, and content page corresponding to parsing the honor information data item obtains described first The honor information of enterprise in charge of construction, by honor information storage into enterprise in charge of construction's information database;And
Step 7: using the 3rd crawlers according to the sub- URL address acquisitions of the acceptance of the bid information data item in described Content page corresponding to information data item is marked, and parses the content page corresponding to information data item of getting the bid and obtains described first The acceptance of the bid information of enterprise in charge of construction, by the acceptance of the bid information storage into enterprise in charge of construction's information database;And
Step 8: using the 4th crawlers according to the sub- URL address acquisitions of the operation information data item to the warp Content page corresponding to information data item is sought, and content page corresponding to parsing the operation information data item obtains described first The operation information of enterprise in charge of construction, by operation information storage into enterprise in charge of construction's information database.
Wherein, step 5 each crawlers into step 8 are separate, can use scrapy frameworks respectively To complete, for the corresponding independent crawlers of each data item in step 5 to step 8 so that data item it Between crawl and be independent of each other, crawl speed and have a certain upgrade.Specifically, in step 5, what company introduction data item can refer to It is the industrial and commercial registration information of enterprise in charge of construction, such as can includes:The enterprise name of enterprise in charge of construction and unique mark, enterprise Industry sets up time, registration authority, shareholder's information, registered capital and investments abroad situation.Honor information data item also refers to Honor item acquired in enterprise in charge of construction, such as the prize-winning situation of the prize-winning situation of engineering and enterprise of the enterprise in charge of construction. Acceptance of the bid information data item also refers to acceptance of the bid situation of the enterprise in charge of construction in engineering undertaking, and operation information data item can be with Refer to the management state information of enterprise in charge of construction.
Step 9: according to the enterprise name of first enterprise in charge of construction from construction market supervision and sincere distribution platform On get the sub- URL addresses of qualification grade data item and the son of structure art information data item of first enterprise in charge of construction URL addresses.
Wherein, enterprise qualification feelings of the construction market supervision with being stored with multiple enterprises in charge of construction in sincere distribution platform Condition, if inquiring about the enterprise qualification situation less than enterprise in charge of construction in construction market supervision and sincere distribution platform, illustrate The enterprise in charge of construction does not possess constructional enterprises qualification, can also be from building for the enterprise with constructional enterprises qualification Market surpervision gets qualification grade data and structure art information data with sincere distribution platform.
After the completion of step 9 execution, following steps ten, step 11 are performed respectively.
Step 10: using the 5th crawlers according to the sub- URL address acquisitions of the qualification grade data item to the money Content page corresponding to matter level data item, and content page corresponding to parsing the qualification grade data item obtains described first The qualification grade information of enterprise in charge of construction, enterprise in charge of construction's information database is arrived into qualification grade information storage In;And
Step 11: using the 6th crawlers according to the sub- URL address acquisitions of the structure art information data item to institute Content page corresponding to structure art information data item is stated, and content page corresponding to parsing the structure art information data item obtains The structure art information of first enterprise in charge of construction, enterprise in charge of construction's Information Number is arrived into structure art information storage According in storehouse.
In embodiments of the present invention, the mass data of internet can be used to adopt the credit data of enterprise in charge of construction Collection and excavation, the enterprise in charge of construction's data excavated can be used for generation new firms credit appraisal pattern, and be exclusively used in building The credit appraisal of construction enterprises.Reptile instrument can be used to crawl the enterprise of enterprise in charge of construction from big data in the embodiment of the present invention Industry specifying information, each enterprise in charge of construction can capture 6 following data item, be respectively:Company introduction, acceptance of the bid letter Breath, operation information, honor information, qualification grade, structure art information.Applied for example, it is also possible to crawl building by big data technology Industrial and commercial registration information, constructional enterprises qualification grade standard information, honor information and the history of work enterprise are honoured an agreement behavioural information, In the embodiment of the present invention, the industrial and commercial registration information of enterprise in charge of construction can include:The enterprise name of enterprise in charge of construction with only One mark, enterprise set up time, registration authority, shareholder's information, registered capital and investments abroad situation.The honor of enterprise in charge of construction Reputation information can include:Engineering awards, enterprise's awards and developmental achievement that enterprise in charge of construction is obtained etc..History is honoured an agreement behavior Information includes:The acceptance of the bid number information and event of default information of enterprise in charge of construction, event of default information include:Each building is applied The record of bad behavior information of work enterprise and law court are performed case information.Each data item is using independent in the embodiment of the present invention Crawlers, therefore each data item can be realized and crawled parallel, can by the URL addresses of enterprise in charge of construction and enterprise name , can pin to capture the sub- URL addresses of each data item, therefore using the search strategy of breadth First in the embodiment of the present invention The web page resources in the industry are carried out to enterprise in charge of construction to crawl, and improve the speed for the web page resources for crawling enterprise in charge of construction Degree.
In some embodiments of the invention, the step 4, specifically comprises the following steps:
It is super literary according to the homepage of the URL address acquisitions of first enterprise in charge of construction to first enterprise in charge of construction This markup language html web page;
Homepage html web page is parsed, all hyperlink of the homepage html web page are scanned for, completes first level pages Traversal after proceed by the search of the two level page again, by this endless form untill underlying pages search is completed, searched for Into the sub- URL addresses of the company introduction data item of rear output first enterprise in charge of construction, the sub- URL of honor information data item Address, sub- URL addresses, the sub- URL addresses of operation information data item of information data item of getting the bid.
It is illustrated below, parses the homepage HTML of the first enterprise in charge of construction with BeautifulSoup first, acquisition is treated The sub- url of data item is crawled, then circulation obtains the url (i.e. content url) of each single item data content under the sub- url, uses The HTML of BeautifulSoup parsing content pages is simultaneously extracted in information needed deposit database.
Further, in some embodiments of the invention, the URL addresses according to first enterprise in charge of construction The homepage HTML html web page of first enterprise in charge of construction is got, is comprised the following steps:
When the URL addresses of first enterprise in charge of construction are specially login page URL addresses, looked forward to by construction Industry inquiring client terminal inputs username and password;
After logining successfully, judge whether verified on the homepage html web page of first enterprise in charge of construction Code;
If not occurring identifying code on the homepage html web page, triggering performs the step of parsing homepage html web page;Or Person,
If there is identifying code on the homepage html web page, the identifying code picture on the homepage html web page is identified, Obtain the pictorial information on the identifying code picture;The pictorial information is submitted into server to be verified;When identifying code is known Not by rear, triggering performs the step of parsing homepage html web page;When identifying code identification not over when, re-recognize the head Identifying code picture on page html web page.
In some embodiments of the invention, the step 5, specifically comprises the following steps:
Using the first crawlers, circulation obtains each single item company under the sub- URL addresses of the company introduction data item The URL addresses of profile data content, and parse html web page corresponding to the URL addresses of each single item company introduction data content The company introduction information of first enterprise in charge of construction is obtained, the company introduction information is stored to enterprise in charge of construction and believed Cease in database.
It should be noted that in embodiments of the present invention, foregoing teachings are lifted to the specific implementation of step 5 Example explanation, wherein the specific implementation of step 6 to step 8 can also be similar with step 5, does not repeat herein.
In some embodiments of the invention, the step 10, specifically comprises the following steps:
Judge that first enterprise in charge of construction whether there is institute on the construction market supervision and sincere distribution platform State qualification grade data item corresponding to the enterprise name of the first enterprise in charge of construction;
If the qualification of the construction market supervision with first enterprise in charge of construction is not present on sincere distribution platform etc. DBMS item, then continue to read the enterprise name of next enterprise in charge of construction from the name text file;
If the construction market supervision and the qualification grade that first enterprise in charge of construction on sincere distribution platform be present Subdata item, the qualification grade content page of first enterprise in charge of construction is got using the 5th crawlers, solved The qualification grade content page is analysed, extracts the CompanyName of first enterprise in charge of construction, qualification classification, qualification certificates Number, qualification title, data of issue, validity period of certificate, licence issuing authority;
By the CompanyName of first enterprise in charge of construction, qualification classification, qualification certificates number, qualification title, date of issue Phase, validity period of certificate, licence issuing authority successively sequential storage into enterprise in charge of construction's information database.
In some embodiments of the invention, the step 11, specifically comprises the following steps:
Judge that first enterprise in charge of construction whether there is institute on the construction market supervision and sincere distribution platform State structure art information data item corresponding to the enterprise name of the first enterprise in charge of construction;
If the construction market supervision and the structure art that first enterprise in charge of construction is not present on sincere distribution platform Information data item, then continue to read the enterprise name of next enterprise in charge of construction from the name text file;
If the construction market supervision is believed with the structure art of first enterprise in charge of construction on sincere distribution platform be present Data item is ceased, the structure art content page of first enterprise in charge of construction, parsing are got using the 6th crawlers The structure art content page, extract the structure art name of first enterprise in charge of construction, identification card number, registration classification, Number of registration, registration specialty;
By the structure art name of first enterprise in charge of construction, identification card number, registration classification, number of registration, registration specialty Sequential storage is into enterprise in charge of construction's information database successively.
In some embodiments of the invention, first crawlers, second crawlers, the 3rd reptile Program, the 4th crawlers, the 5th crawlers and the 6th crawlers pass through construction simultaneously respectively Enterprise query client crawls corresponding data item to be crawled.
Wherein, the first crawlers, the second crawlers, the 3rd crawlers, the 4th crawlers, the 5th crawlers It is the reptile instrument that independently performs with the 6th crawlers, each reptile instrument can perform simultaneously, so as to improve data Item crawls efficiency.
In some embodiments of the invention, first crawlers, second crawlers, the 3rd reptile Program, the 4th crawlers, the 5th crawlers and the 6th crawlers all use separate Scrapy frameworks.
Wherein, for the implementation process of scrapy frameworks, prior art can be referred to.
For ease of being better understood from and implementing the such scheme of the embodiment of the present invention, corresponding application scenarios of illustrating below come It is specifically described.
In the prior art, web crawlers is to crawl web terminal mostly, but for need user name, password login website, Account is easily lead to by the substantial amounts of gathered data of web terminal to be closed, so as to cause big data collection failure.And the present invention is real The website logged in example for needs is applied, by crawling cell-phone customer terminal gathered data, effectively reduces what account was closed Risk, it thereby may be ensured that being smoothed out for big data collection.
The search strategy of breadth First is taken in the embodiment of the present invention, i.e., is started the cycle over from top layer to bottom, first with regard to one-level All hyperlink in the page scan for, and complete to start the search circulation of the two level page after first level pages travel through again, the bottom of until Untill layer.Scrapy frameworks, the corresponding independent crawlers of each data item so that number are used in the embodiment of the present invention It is independent of each other according to crawling between item, crawls speed and have a certain upgrade.
Existing reptile is not related to the function of identifying code identification typically.Identification is added in the embodiment of the present invention in a program to test The function of code is demonstrate,proved, program further simulates manually, reduces the anti-risk climbed, and also saves and manually enters identifying code Time so that crawlers are more smooth.
The embodiment of the present invention is acquired just for the related data of construction industry company, and the embodiment of the present invention provides one kind and grabbed The collecting method of national all construction industry companies relevant information is taken, 6 data item are captured per a company, is respectively:It is public Take charge of brief introduction, acceptance of the bid information, operation information, honor information, qualification grade, structure art information.Each corresponding independence of data item Crawlers, each crawlers use scrapy frameworks.This 6 partial information comes from two websites, is respectively:Build The logical, platform of four storehouse one.Wherein, company introduction, acceptance of the bid information, operation information, honor information this four partial information come from construction Logical cell-phone customer terminal;This two parts information of qualification grade, structure art information comes from the platform of four storehouse one.
First, company introduction, acceptance of the bid information, operation information, honor information crawl.
1st, company URL is crawled.
Captured successively by province, city, county, area, it is therefore an objective to obtain the url and exabyte of each province, city, county, Qu Suoyou companies Com_name, and in url texts and com_name texts corresponding to deposit, in case crawling company related data below Use.
2nd, the contiguous items content of each company is crawled.
A) non-login just retrievable data item, including company introduction, honor information.
Program reads company's url texts line by line, is successively read the url of every company;BeautifulSoup is used first (a Python storehouse that data can be extracted from HTML or XML) parses company homepage HTML, obtains data item to be crawled Url (sub- url);Then circulation obtains the url (content url) of each single item data content under the sub- url;BeautifulSoup Parse the HTML of content pages and extract in information needed deposit database.
B) need to log in just retrievable data item, including acceptance of the bid information, operation information.
Using the login page url of the website as the entrance of program, program write-in user name, password, manual entry is simulated, Data message needed for being captured after logining successfully by method described in a).From a) in unlike, each time obtain url after, Before parsing HTML, the judgement of " whether identifying code occur " carried out.If there is not identifying code, directly solved after obtaining url Analyse HTML;If there is identifying code, then identifying code picture is first identified, be verified the numeral of code, step is as follows:1) parsing is current The page obtains the url of identifying code picture, and request is verified yard picture and opened;2) HIS color spaces are transformed into using RBG, Colour picture is converted into by gray level image using L * component;3) using direct thresholding to image noise reduction, threshold value takes 90, and (value is Obtained by many experiments), the pixel more than threshold value is set to 1, others are set to 0;4) (it is used for picture text with pytesser The Python modules of this identification) in image_to_string functions the character in picture is converted into text, just tested Demonstrate,prove yardage word.Obtained identifying code numeral is submitted into server, if identifying code is submitted correctly, continues to parse HTML, if testing Card code submission is incorrect, then is performed by the situation continued of " identifying code occur ".The subprogram flow chart is as shown in Figure 2.
2nd, qualification grade, structure art information crawl, and the subprogram flow chart is as shown in Figure 3.
1st, qualification grade crawls.
The data item needs to crawl 6 sub- data item, is respectively:Qualification classification, qualification certificates number, qualification title, issue licence Date, validity period of certificate, licence issuing authority.
Com_ texts are read line by line, are crawled one by one by exabyte.
A) first determine whether that the said firm whether there is in the platform of four storehouse one, if it does not, directly skip, continue to read next Individual company;If it is present perform b) step.
B) judge whether the said firm has qualification grade correlator data item information, if not provided, directly skipping, continue to read Lower a company;If so, then perform c) step.
C) qualification grade page HTML is parsed, extracts 6 sub- data item information therein, and by CompanyName, qualification class Not, qualification certificates number, qualification title, data of issue, validity period of certificate, the order of licence issuing authority are sequentially stored into database.
2nd, structure art information crawls.
The data item needs to crawl 5 sub- data item, is respectively:Structure art name, identification card number, registration classification, registration Number (operation seal number), registration specialty.
Wherein, data crawl that flow is similar with crawling for qualification grade, refer to the illustration of previous embodiment.
The present invention has used scrapy for needing the website of user name, password login to be acquired by APP clients Reptile framework, each data item use independent crawlers, and using the search strategy of breadth First, identifying code identification division makes Calling program preferably simulates manually, reduces the anti-risk climbed, also improves the speed crawled to a certain extent.
Refer to shown in Fig. 4, the embodiment of the present application also provides a kind of big data acquisition system towards enterprise in charge of construction 400, including:
Company information acquisition module 401, for being obtained successively according to the hierarchical regions position relationship where enterprise in charge of construction Take the respective uniform resource position mark URL address of multiple enterprises in charge of construction and corresponding enterprise name;
File storage module 402, for the respective URL addresses of the multiple enterprise in charge of construction to be saved in into URL texts In file, the respective enterprise name of the multiple enterprise in charge of construction is saved in name text file;
File read module 403, for reading from the URL texts URL of the first enterprise in charge of construction Location, and the enterprise name of first enterprise in charge of construction is read from the name text file, wherein, described first Enterprise in charge of construction is any one enterprise in charge of construction in the multiple enterprise in charge of construction;
After the completion of the file read module 403 execution, the first sub- of URL acquisition modules 404 and second is performed respectively URL acquisition modules 409, wherein,
The first sub- URL acquisition modules 404, for being obtained respectively according to the URL addresses of first enterprise in charge of construction Get the sub- URL addresses of the company introduction data item of first enterprise in charge of construction, the sub- URL of honor information data item Location, sub- URL addresses, the sub- URL addresses of operation information data item of information data item of getting the bid;Mould is obtained in the described first sub- URL Block perform after the completion of, perform respectively following first reptile processing module, the second reptile processing module, the 3rd reptile processing module, 4th reptile processing module;
First reptile processing module 405, for the sub- URL using the first crawlers according to the company introduction data item Content page corresponding to address acquisition to the company introduction data item, and parse content corresponding to the company introduction data item Webpage obtains the company introduction information of first enterprise in charge of construction, and company introduction information storage is looked forward to construction In industry information database;And
Second reptile processing module 406, for the sub- URL using the second crawlers according to the honor information data item Content page corresponding to address acquisition to the honor information data item, and parse content corresponding to the honor information data item Webpage obtains the honor information of first enterprise in charge of construction, and honor information storage is believed to the enterprise in charge of construction Cease in database;And
3rd reptile processing module 407, for the sub- URL using the 3rd crawlers according to the acceptance of the bid information data item Content page corresponding to address acquisition to the acceptance of the bid information data item, and parse content corresponding to the acceptance of the bid information data item Webpage obtains the acceptance of the bid information of first enterprise in charge of construction, and the acceptance of the bid information storage is believed to the enterprise in charge of construction Cease in database;And
4th reptile processing module 408, for the sub- URL using the 4th crawlers according to the operation information data item Content page corresponding to address acquisition to the operation information data item, and parse content corresponding to the operation information data item Webpage obtains the operation information of first enterprise in charge of construction, and operation information storage is believed to the enterprise in charge of construction Cease in database;
The second sub- URL acquisition modules 409, for the enterprise name according to first enterprise in charge of construction from building The sub- URL addresses of qualification grade data item of the market surpervision with getting first enterprise in charge of construction on sincere distribution platform With the sub- URL addresses of structure art information data item;
After the completion of the described second sub- URL acquisition modules 409 perform, perform respectively following 5th reptile processing module 410, 6th reptile processing module 411;
5th reptile processing module 410, for the sub- URL using the 5th crawlers according to the qualification grade data item Content page corresponding to address acquisition to the qualification grade data item, and parse content corresponding to the qualification grade data item Webpage obtains the qualification grade information of first enterprise in charge of construction, and qualification grade information storage is applied to the building In work company information data storehouse;And
6th reptile processing module 411, for the son using the 6th crawlers according to the structure art information data item Content page corresponding to URL address acquisitions to the structure art information data item, and it is right to parse the structure art information data item The content page answered obtains the structure art information of first enterprise in charge of construction, and structure art information storage is built described in Build in construction enterprises' information database.
It should be noted that, device embodiment described above is only schematical in addition, wherein described as separation The unit of part description can be or may not be it is physically separate, can be as the part that unit is shown or It can not be physical location, you can with positioned at a place, or can also be distributed on multiple NEs.Can be according to reality Border needs to select some or all of module therein to realize the purpose of this embodiment scheme.It is in addition, provided by the invention In device embodiment accompanying drawing, the annexation between module represents there is communication connection between them, specifically can be implemented as one Bar or a plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, you can with Understand and implement.
Through the above description of the embodiments, it is apparent to those skilled in the art that the present invention can borrow Software is helped to add the mode of required common hardware to realize, naturally it is also possible to include application specific integrated circuit, specially by specialized hardware Realized with CPU, private memory, special components and parts etc..Generally, all functions of being completed by computer program can Easily realized with corresponding hardware, moreover, for realizing that the particular hardware structure of same function can also be a variety of more Sample, such as analog circuit, digital circuit or special circuit etc..But it is more for the purpose of the present invention in the case of software program it is real It is now more preferably embodiment.Based on such understanding, technical scheme is substantially made to prior art in other words The part of contribution can be embodied in the form of software product, and the computer software product is stored in the storage medium that can be read In, such as the floppy disk of computer, USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), random access memory Device (RAM, Random Access Memory), magnetic disc or CD etc., including some instructions are causing a computer to set Standby (can be personal computer, server, or network equipment etc.) performs the method described in each embodiment of the present invention.
In summary, the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although with reference to upper Embodiment is stated the present invention is described in detail, it will be understood by those within the art that:It still can be to upper State the technical scheme described in each embodiment to modify, or equivalent substitution is carried out to which part technical characteristic;And these Modification is replaced, and the essence of appropriate technical solution is departed from the spirit and scope of various embodiments of the present invention technical scheme.

Claims (9)

  1. A kind of 1. big data acquisition method towards enterprise in charge of construction, it is characterised in that including:
    Step 1: multiple enterprises in charge of construction are obtained successively each according to the hierarchical regions position relationship where enterprise in charge of construction Uniform resource position mark URL address and corresponding enterprise name;
    Step 2: the respective URL addresses of the multiple enterprise in charge of construction are saved in URL texts, will be the multiple The respective enterprise name of enterprise in charge of construction is saved in name text file;
    Step 3: read the URL addresses of the first enterprise in charge of construction from the URL texts, and from the title The enterprise name of first enterprise in charge of construction is read in text, wherein, first enterprise in charge of construction is institute State any one enterprise in charge of construction in multiple enterprises in charge of construction;
    After the completion of step 3 execution, step 4 and step 9 are performed respectively;
    Step 4: first enterprise in charge of construction is got according to the URL addresses of first enterprise in charge of construction respectively The sub- URL addresses of company introduction data item, the sub- URL addresses of honor information data item, the sub- URL for information data item of getting the bid Location, the sub- URL addresses of operation information data item;After the completion of step 4 execution, following steps five, step are performed respectively 6th, Step 7: step 8;
    It is Step 5: simple to the company according to the sub- URL address acquisitions of the company introduction data item using the first crawlers Content page corresponding to Jie's data item, and content page corresponding to parsing the company introduction data item obtains first building The company introduction information of construction enterprises, by company introduction information storage into enterprise in charge of construction's information database;And
    Step 6: believed using the second crawlers according to the sub- URL address acquisitions of the honor information data item to the honor Content page corresponding to data item is ceased, and content page corresponding to parsing the honor information data item obtains first building The honor information of construction enterprises, by honor information storage into enterprise in charge of construction's information database;And
    Step 7: believed using the 3rd crawlers according to the sub- URL address acquisitions of the acceptance of the bid information data item to the acceptance of the bid Content page corresponding to data item is ceased, and parses the content page corresponding to information data item of getting the bid and obtains first building The acceptance of the bid information of construction enterprises, by the acceptance of the bid information storage into enterprise in charge of construction's information database;And
    Step 8: believed using the 4th crawlers according to the sub- URL address acquisitions of the operation information data item to described manage Content page corresponding to data item is ceased, and content page corresponding to parsing the operation information data item obtains first building The operation information of construction enterprises, by operation information storage into enterprise in charge of construction's information database;
    Step 9: according to the enterprise name of first enterprise in charge of construction from construction market supervision with being obtained on sincere distribution platform Get the sub- URL addresses of the qualification grade data item of first enterprise in charge of construction and the sub- URL of structure art information data item Address;
    After the completion of step 9 execution, following steps ten, step 11 are performed respectively;
    Step 10: using the 5th crawlers according to sub- URL address acquisitions of the qualification grade data item to the qualification etc. Content page corresponding to DBMS item, and content page corresponding to parsing the qualification grade data item obtains first building The qualification grade information of construction enterprises, by qualification grade information storage into enterprise in charge of construction's information database; And
    Step 11: built using the 6th crawlers according to the sub- URL address acquisitions of the structure art information data item described in Content page corresponding to teacher of the make'ing information data item, and content page corresponding to parsing the structure art information data item obtain it is described The structure art information of first enterprise in charge of construction, enterprise in charge of construction's information database is arrived into structure art information storage In.
  2. 2. a kind of big data acquisition method towards enterprise in charge of construction according to claim 1, it is characterised in that described Step 4, specifically comprise the following steps:
    According to the homepage hypertext mark of the URL address acquisitions of first enterprise in charge of construction to first enterprise in charge of construction Remember language html web page;
    Homepage html web page is parsed, all hyperlink of the homepage html web page are scanned for, complete time of first level pages The search of the two level page is proceeded by after going through again, by this endless form untill underlying pages search is completed, after the completion of search Export the sub- URL addresses of the company introduction data item of first enterprise in charge of construction, the sub- URL of honor information data item Location, sub- URL addresses, the sub- URL addresses of operation information data item of information data item of getting the bid.
  3. 3. a kind of big data acquisition method towards enterprise in charge of construction according to claim 2, it is characterised in that described According to the homepage hypertext markup language of the URL address acquisitions of first enterprise in charge of construction to first enterprise in charge of construction Html web page is sayed, is comprised the following steps:
    When the URL addresses of first enterprise in charge of construction are specially login page URL addresses, looked into by enterprise in charge of construction Ask client input username and password;
    After logining successfully, judge whether identifying code occur on the homepage html web page of first enterprise in charge of construction;
    If not occurring identifying code on the homepage html web page, triggering performs the step of parsing homepage html web page;Or
    If there is identifying code on the homepage html web page, the identifying code picture on the homepage html web page is identified, is obtained Pictorial information on the identifying code picture;The pictorial information is submitted into server to be verified;When identifying code identification is logical Later, triggering performs the step of parsing homepage html web page;When identifying code identification not over when, re-recognize the homepage Identifying code picture on html web page.
  4. 4. a kind of big data acquisition method towards enterprise in charge of construction according to claim 1, it is characterised in that described Step 5, specifically comprise the following steps:
    Using first crawlers, circulation obtains each single item company under the sub- URL addresses of the company introduction data item The URL addresses of profile data content, and parse html web page corresponding to the URL addresses of each single item company introduction data content The company introduction information of first enterprise in charge of construction is obtained, the company introduction information is stored to enterprise in charge of construction and believed Cease in database.
  5. 5. a kind of big data acquisition method towards enterprise in charge of construction according to claim 1, it is characterised in that described Step 10, specifically comprise the following steps:
    Judge first enterprise in charge of construction on the construction market supervision and sincere distribution platform with the presence or absence of described the Qualification grade data item corresponding to the enterprise name of one enterprise in charge of construction;
    If the construction market supervision and the qualification grade number that first enterprise in charge of construction is not present on sincere distribution platform According to item, then continue to read the enterprise name of next enterprise in charge of construction from the name text file;
    If the construction market supervision and the qualification grade data that first enterprise in charge of construction on sincere distribution platform be present , the qualification grade content page of first enterprise in charge of construction is got using the 5th crawlers, described in parsing Qualification grade content page, extract the CompanyName of first enterprise in charge of construction, qualification classification, qualification certificates number, money Matter title, data of issue, validity period of certificate, licence issuing authority;
    By the CompanyName of first enterprise in charge of construction, qualification classification, qualification certificates number, qualification title, data of issue, card The book term of validity, licence issuing authority successively sequential storage into enterprise in charge of construction's information database.
  6. 6. a kind of big data acquisition method towards enterprise in charge of construction according to claim 1, it is characterised in that described Step 11, specifically comprise the following steps:
    Judge first enterprise in charge of construction on the construction market supervision and sincere distribution platform with the presence or absence of described the Structure art information data item corresponding to the enterprise name of one enterprise in charge of construction;
    If the construction market supervision and the structure art information that first enterprise in charge of construction is not present on sincere distribution platform Data item, then continue to read the enterprise name of next enterprise in charge of construction from the name text file;
    If the construction market supervision and the structure art Information Number that first enterprise in charge of construction on sincere distribution platform be present According to item, the structure art content page of first enterprise in charge of construction is got using the 6th crawlers, described in parsing Structure art content page, extract the structure art name of first enterprise in charge of construction, identification card number, registration classification, registration Number, registration specialty;
    By the structure art name of first enterprise in charge of construction, identification card number, registration classification, number of registration, registration specialty successively Sequential storage is into enterprise in charge of construction's information database.
  7. 7. a kind of big data acquisition method towards enterprise in charge of construction according to claim 1, it is characterised in that described First crawlers, second crawlers, the 3rd crawlers, the 4th crawlers, the 5th reptile Program and the 6th crawlers crawl corresponding data to be crawled by enterprise in charge of construction's inquiring client terminal simultaneously respectively .
  8. 8. a kind of big data acquisition method towards enterprise in charge of construction according to claim 1, it is characterised in that described First crawlers, second crawlers, the 3rd crawlers, the 4th crawlers, the 5th reptile Program and the 6th crawlers all use separate scrapy frameworks.
  9. A kind of 9. big data acquisition system towards enterprise in charge of construction, it is characterised in that including:
    Company information acquisition module, for obtaining multiple build successively according to the hierarchical regions position relationship where enterprise in charge of construction Build the respective uniform resource position mark URL address of construction enterprises and corresponding enterprise name;
    File storage module, for the respective URL addresses of the multiple enterprise in charge of construction to be saved in URL texts, The respective enterprise name of the multiple enterprise in charge of construction is saved in name text file;
    File read module, for reading the URL addresses of the first enterprise in charge of construction from the URL texts, and The enterprise name of first enterprise in charge of construction is read from the name text file, wherein, first building is applied Work enterprise is any one enterprise in charge of construction in the multiple enterprise in charge of construction;
    After the completion of file read module execution, the first sub- URL acquisition modules are performed respectively and the second sub- URL obtains mould Block, wherein,
    The first sub- URL acquisition modules, described in being got respectively according to the URL addresses of first enterprise in charge of construction The sub- URL addresses of the company introduction data item of first enterprise in charge of construction, the sub- URL addresses of honor information data item, acceptance of the bid letter Cease sub- URL addresses, the sub- URL addresses of operation information data item of data item;Completion is performed in the described first sub- URL acquisition modules Afterwards, perform respectively following first reptile processing module, the second reptile processing module, the 3rd reptile processing module, at the 4th reptile Manage module;
    First reptile processing module, for being obtained using the first crawlers according to the sub- URL addresses of the company introduction data item Content page corresponding to the company introduction data item is got, and content page corresponding to parsing the company introduction data item obtains To the company introduction information of first enterprise in charge of construction, enterprise in charge of construction's information is arrived into company introduction information storage In database;And
    Second reptile processing module, for being obtained using the second crawlers according to the sub- URL addresses of the honor information data item Content page corresponding to the honor information data item is got, and content page corresponding to parsing the honor information data item obtains To the honor information of first enterprise in charge of construction, enterprise in charge of construction's information data is arrived into honor information storage In storehouse;And
    3rd reptile processing module, for being obtained using the 3rd crawlers according to the sub- URL addresses of the acceptance of the bid information data item Content page corresponding to the acceptance of the bid information data item is got, and parses the content page corresponding to information data item of getting the bid and obtains To the acceptance of the bid information of first enterprise in charge of construction, enterprise in charge of construction's information data is arrived into the acceptance of the bid information storage In storehouse;And
    4th reptile processing module, for being obtained using the 4th crawlers according to the sub- URL addresses of the operation information data item Content page corresponding to the operation information data item is got, and content page corresponding to parsing the operation information data item obtains To the operation information of first enterprise in charge of construction, enterprise in charge of construction's information data is arrived into operation information storage In storehouse;
    The second sub- URL acquisition modules, supervised for the enterprise name according to first enterprise in charge of construction from construction market The sub- URL addresses of qualification grade data item of the pipe with getting first enterprise in charge of construction on sincere distribution platform and construction The sub- URL addresses of teacher's information data item;
    After the completion of the described second sub- URL acquisition modules perform, following 5th reptile processing module is performed respectively, at the 6th reptile Manage module;
    5th reptile processing module, for being obtained using the 5th crawlers according to the sub- URL addresses of the qualification grade data item Content page corresponding to the qualification grade data item is got, and content page corresponding to parsing the qualification grade data item obtains To the qualification grade information of first enterprise in charge of construction, the enterprise in charge of construction is arrived into qualification grade information storage In information database;And
    6th reptile processing module, for the sub- URL addresses using the 6th crawlers according to the structure art information data item Content page corresponding to the structure art information data item is got, and parses content corresponding to the structure art information data item Webpage obtains the structure art information of first enterprise in charge of construction, and structure art information storage is looked forward to the construction In industry information database.
CN201710809082.8A 2017-09-09 2017-09-09 Big data acquisition method and system for building construction enterprises Active CN107590236B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710809082.8A CN107590236B (en) 2017-09-09 2017-09-09 Big data acquisition method and system for building construction enterprises

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710809082.8A CN107590236B (en) 2017-09-09 2017-09-09 Big data acquisition method and system for building construction enterprises

Publications (2)

Publication Number Publication Date
CN107590236A true CN107590236A (en) 2018-01-16
CN107590236B CN107590236B (en) 2020-08-28

Family

ID=61051124

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710809082.8A Active CN107590236B (en) 2017-09-09 2017-09-09 Big data acquisition method and system for building construction enterprises

Country Status (1)

Country Link
CN (1) CN107590236B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020092A (en) * 2018-11-20 2019-07-16 皮商云集(厦门)科技有限公司 Leather industry data center systems based on web crawlers
CN110134893A (en) * 2019-04-03 2019-08-16 广州朗国电子科技有限公司 A kind of multimachine structure retrieval display method and device based on cloud information issuing system
CN110502680A (en) * 2019-08-27 2019-11-26 重庆大司空信息科技有限公司 A kind of abstracting method and device of acceptance of the bid bulletin relevant field
CN110728418A (en) * 2019-08-26 2020-01-24 成都市互联互通大数据科技有限公司 Method for counting waste standard rate
CN112801820A (en) * 2021-02-05 2021-05-14 郝大伟 Big data acquisition method for building construction enterprises

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102937989A (en) * 2012-10-29 2013-02-20 北京腾逸科技发展有限公司 Parallel distributed internet data capture method and system
CN103226568A (en) * 2013-03-12 2013-07-31 北京百度网讯科技有限公司 Method and equipment for crawling page
US20140032540A1 (en) * 1999-07-20 2014-01-30 Google Inc. Internet based system and apparatus for paying users to view content and receiving micropayments
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method
CN104376063A (en) * 2014-11-11 2015-02-25 南京邮电大学 Multithreading web crawler method based on sort management and real-time information updating system
CN105279272A (en) * 2015-10-30 2016-01-27 南京未来网络产业创新有限公司 Content aggregation method based on distributed web crawlers
CN105389310A (en) * 2014-09-03 2016-03-09 上海尧博信息科技有限公司 Method of applying web crawlers to household registration management

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140032540A1 (en) * 1999-07-20 2014-01-30 Google Inc. Internet based system and apparatus for paying users to view content and receiving micropayments
CN102937989A (en) * 2012-10-29 2013-02-20 北京腾逸科技发展有限公司 Parallel distributed internet data capture method and system
CN103226568A (en) * 2013-03-12 2013-07-31 北京百度网讯科技有限公司 Method and equipment for crawling page
CN103605764A (en) * 2013-11-26 2014-02-26 Tcl集团股份有限公司 Web crawler system and web crawler multitask executing and scheduling method
CN105389310A (en) * 2014-09-03 2016-03-09 上海尧博信息科技有限公司 Method of applying web crawlers to household registration management
CN104376063A (en) * 2014-11-11 2015-02-25 南京邮电大学 Multithreading web crawler method based on sort management and real-time information updating system
CN105279272A (en) * 2015-10-30 2016-01-27 南京未来网络产业创新有限公司 Content aggregation method based on distributed web crawlers

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
尤建新 等: "《基于Web数据挖掘的网站知识获取及应用——以大众点评网为例》", 《上海大学学报(自然科学版)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020092A (en) * 2018-11-20 2019-07-16 皮商云集(厦门)科技有限公司 Leather industry data center systems based on web crawlers
CN110134893A (en) * 2019-04-03 2019-08-16 广州朗国电子科技有限公司 A kind of multimachine structure retrieval display method and device based on cloud information issuing system
CN110134893B (en) * 2019-04-03 2022-05-31 广州朗国电子科技股份有限公司 Multi-mechanism retrieval display method and device based on cloud information publishing system
CN110728418A (en) * 2019-08-26 2020-01-24 成都市互联互通大数据科技有限公司 Method for counting waste standard rate
CN110502680A (en) * 2019-08-27 2019-11-26 重庆大司空信息科技有限公司 A kind of abstracting method and device of acceptance of the bid bulletin relevant field
CN112801820A (en) * 2021-02-05 2021-05-14 郝大伟 Big data acquisition method for building construction enterprises

Also Published As

Publication number Publication date
CN107590236B (en) 2020-08-28

Similar Documents

Publication Publication Date Title
CN107590236A (en) A kind of big data acquisition method and system towards enterprise in charge of construction
US8219549B2 (en) Forum mining for suspicious link spam sites detection
DE112010002445T9 (en) Identification of bots
Mahto et al. A dive into Web Scraper world
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN105763543B (en) A kind of method and device identifying fishing website
CN104601573B (en) A kind of Android platform URL accesses result verification method and device
CN107943838B (en) Method and system for automatically acquiring xpath generated crawler script
CN107437026B (en) Malicious webpage advertisement detection method based on advertisement network topology
CN107025296A (en) Based on science service information intelligent grasping system method of data capture
CN107341399A (en) Assess the method and device of code file security
CN101938466A (en) Open knowledge bases method and equipment for user authentication
CN106126747A (en) Data capture method based on reptile and device
CN110909229A (en) Webpage data acquisition and storage system based on simulated browser access
CN108566399A (en) Fishing website recognition methods and system
CN103294732A (en) Web page crawling method and spider
WO2017063274A1 (en) Method for automatically determining malicious-jumping and malicious-nesting offensive websites
CN111859234A (en) Illegal content identification method and device, electronic equipment and storage medium
CN109104421A (en) A kind of web site contents altering detecting method, device, equipment and readable storage medium storing program for executing
CN107807937A (en) A kind of website SEO processing methods, apparatus and system
Sanglerdsinlapachai et al. Web phishing detection using classifier ensemble
CN106547803A (en) The method and apparatus for crawling website incremental resource
CN110502680A (en) A kind of abstracting method and device of acceptance of the bid bulletin relevant field
CN109948339A (en) A kind of malicious script detection method based on machine learning
CN104899320A (en) Webpage repair method, terminal, server and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: No.01, 9 / F, building 2, Guigu building, ningwei street, Xiaoshan District, Hangzhou City, Zhejiang Province

Applicant after: Digital cube (Hangzhou) Information Technology Co.,Ltd.

Address before: Yuhang District, Hangzhou City, Zhejiang Province, 311100 West Sea Park No. 998 Building No. 4 802

Applicant before: HANGZHOU DATA CUBE CREDIT REFERENCE Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Big data acquisition method and system for construction enterprises

Effective date of registration: 20230713

Granted publication date: 20200828

Pledgee: Hangzhou High-tech Financing Guarantee Co.,Ltd.

Pledgor: Digital cube (Hangzhou) Information Technology Co.,Ltd.

Registration number: Y2023330001481

PE01 Entry into force of the registration of the contract for pledge of patent right