CN110516135A - A kind of crawler system and method - Google Patents

A kind of crawler system and method Download PDF

Info

Publication number
CN110516135A
CN110516135A CN201910807818.7A CN201910807818A CN110516135A CN 110516135 A CN110516135 A CN 110516135A CN 201910807818 A CN201910807818 A CN 201910807818A CN 110516135 A CN110516135 A CN 110516135A
Authority
CN
China
Prior art keywords
crawler
backstage
webpage
crawl
risk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910807818.7A
Other languages
Chinese (zh)
Inventor
黄逸群
郑航星
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Shiqu Information and Technology Co Ltd
Original Assignee
Hangzhou Shiqu Information and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Shiqu Information and Technology Co Ltd filed Critical Hangzhou Shiqu Information and Technology Co Ltd
Priority to CN201910807818.7A priority Critical patent/CN110516135A/en
Publication of CN110516135A publication Critical patent/CN110516135A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application provides a kind of crawler system and method, crawler system includes: crawler backstage, for grabbing webpage from network according to preset crawl target;Crawler report real time processing system obtains classification parsing result, and the webpage of the classification parsing result and the crawl of crawler backstage is stored in database for carrying out classification parsing to the webpage of crawler backstage crawl;Management backstage is managed for the webpage to the classification parsing result stored in the database and the crawl of crawler backstage.In this application, the practicability of crawler system can be improved in the above manner.

Description

A kind of crawler system and method
Technical field
This application involves technical field of information processing, in particular to a kind of crawler system and method.
Background technique
Network crawler system is a kind of system that webpage is automatically grabbed from network.The webpage of crawl is specifically supplied to by it Third party's (e.g., search engine) uses.
But currently, the function of network crawler system is only limitted to crawl webpage, network crawler system is had a single function, real It is not high with property.
Summary of the invention
In order to solve the above technical problems, the embodiment of the present application provides a kind of crawler system and method, crawler is improved to reach The purpose of the practicability of system, technical solution are as follows:
A kind of crawler system, comprising:
Crawler backstage, for grabbing webpage from network according to preset crawl target;
Crawler report real time processing system is divided for carrying out classification parsing to the webpage of crawler backstage crawl Class parsing result, and the webpage of the classification parsing result and the crawl of crawler backstage is stored in database;
Management backstage, for the webpage to the classification parsing result stored in the database and the crawl of crawler backstage It is managed.
Preferably, the crawler backstage, is specifically used for:
Website links URL relevant to the preset crawl target is filtered out according to web page analysis algorithm, it will be with In the relevant URL deposit of preset crawl target URL queue to be captured;
According to search strategy from the URL queue to be captured, URL is chosen, as target URL, according to the target URL grabs webpage from network;
Judge whether the grasping condition for reaching setting;
If reaching, terminate to grab;
If not up to, execute it is described according to web page analysis algorithm filter out it is related to the preset crawl target Website links URL the step of, until reaching the grasping condition of the setting.
Preferably, the crawler system further include:
Distributed message middleware is located in real time for the webpage of crawler backstage crawl to be passed to the crawler report Reason system.
Preferably, the management backstage, comprising:
Monitoring alarm subsystem, the update of the webpage for monitoring the crawl of the crawler stored in the database backstage With the presence or absence of exception;
It is abnormal if it exists, then abnormal alarm is carried out, and notify alert receipt people.
Preferably, the management backstage, further includes:
Crawler channel and report management subsystem, for periodically retrieving the crawler stored in the database backstage Column content is specified in the webpage of crawl in appointed website, obtains search result, and column content is had into change according to search result Dynamic website feeds back to administrator.
Preferably, the management backstage, further includes:
Policies and regulations management subsystem is used for according to unified format, after the crawler stored from the database The policies and regulations selected in the webpage of platform crawl are edited, and obtain editable text, and the editable text is uploaded To the database, and inquiry of policy and regulation interface is provided, in the inquiry of policy and regulation interface search and described in showing Editable text.
Preferably, the management backstage, further includes:
Risk report management subsystem is used for typing risk report, and provides risk report search and show interface, in institute It states risk report search and shows that interface receives risk report searching request, and respond the risk report searching request, described Risk report is inquired in database, and is shown, the risk report is to carry out risk to business based on the policies and regulations The report of assessment.
Preferably, the management backstage, further includes:
Risk entry management subsystem is used for typing risk entry, and provides risk vocabulary entry search and show interface, in institute It states risk vocabulary entry search and shows that interface receives the request of risk vocabulary entry search, and respond the risk vocabulary entry search request, described Risk entry is searched in database, and shows risk entry, and the risk entry is based on the policies and regulations, from the crawler The entry with business risk extracted in the webpage of backstage crawl.
Preferably, the management backstage, further includes:
The report of crawler result shows subsystem, for showing the classification parsing result stored in the database and described climbing The webpage of worm backstage crawl.
A kind of crawler method, comprising:
According to preset crawl target, webpage is grabbed from network;
Classification parsing is carried out to the webpage of crawler backstage crawl, obtains classification parsing result, and the classification is solved The webpage for analysing result and the crawl of crawler backstage is stored in database;
The webpage of the classification parsing result stored in the database and the crawl of crawler backstage is managed.
Compared with prior art, the application has the beneficial effect that
In this application, a kind of crawler system, including crawler backstage, crawler report real time processing system and management backstage are provided, Crawler backstage grabs webpage according to preset crawl target from network, realizes the function of crawl webpage, and crawler report is real When processing system realize information computerization collecting function, management backstage realizes information management function, is compared to traditional crawler System only has the function of crawl webpage, and the function of the crawler system of the application is expanded, and the practicability of crawler system can be improved.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, needed in being described below to embodiment Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for this field For those of ordinary skill, without any creative labor, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of logical construction schematic diagram of crawler system provided by the present application;
Fig. 2 is the logical construction schematic diagram of another crawler system provided by the present application;
Fig. 3 is the logical construction schematic diagram of another crawler system provided by the present application;
Fig. 4 is the logical construction schematic diagram of another crawler system provided by the present application;
Fig. 5 is a kind of interface schematic diagram of typing effective information provided by the present application;
Fig. 6 is the logical construction schematic diagram of another crawler system provided by the present application;
Fig. 7 is a kind of schematic diagram at inquiry of policy and regulation interface provided by the present application;
Fig. 8 is provided by the present application a kind of newly-built or the modification page the schematic diagram;
Fig. 9 is the logical construction schematic diagram of another crawler system provided by the present application;
Figure 10 is the schematic diagram that a kind of risk report search provided by the present application shows interface;
Figure 11 is provided by the present application another newly-built or the modification page the schematic diagram;
Figure 12 is the logical construction schematic diagram of another crawler system provided by the present application;
Figure 13 is the schematic diagram of the initial interface of typing risk entry provided by the present application;
Figure 14 is a kind of schematic diagram at the interface of content maintenance provided by the present application;
Figure 15 is a kind of schematic diagram at the interface of case editor provided by the present application;
Figure 16 is the schematic diagram that a kind of risk vocabulary entry search provided by the present application shows interface;
Figure 17 is a kind of displaying interface schematic diagram of the content details of risk entry provided by the present application;
Figure 18 is a kind of schematic diagram at the displaying interface of case content provided by the present application;
Figure 19 is the logical construction schematic diagram of another crawler system provided by the present application;
Figure 20 is a kind of schematic diagram at web page display interface provided by the present application;
Figure 21 is a kind of displaying interface schematic diagram of details provided by the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.
The embodiment of the present application discloses a kind of crawler system, comprising: crawler backstage, for according to preset crawl mesh Mark, grabs webpage from network;Crawler report real time processing system, for classifying to the webpage of crawler backstage crawl Parsing obtains classification parsing result, and the webpage of the classification parsing result and the crawl of crawler backstage is stored in database; Management backstage carries out pipe for the webpage to the classification parsing result stored in the database and the crawl of crawler backstage Reason.In this application, the function of crawler system is expanded, and improves the practicability of crawler system.
Next crawler system disclosed in the embodiment of the present application is introduced, as shown in Figure 1, crawler system includes: Crawler backstage 11, crawler report real time processing system 12 and management backstage 13.
Crawler backstage 11, for grabbing webpage from network according to preset crawl target.
Preferably, crawler backstage 11 grabs the process of webpage from network, may include: according to preset crawl target
A11, website links URL relevant to the preset crawl target is filtered out according to web page analysis algorithm, URL relevant to the preset crawl target is stored in URL queue to be captured;
A12, foundation search strategy choose URL from the URL queue to be captured, as target URL, according to described in Target URL, grabs webpage from network;
A13, judge whether the grasping condition for reaching setting;
If reaching, A14 is thened follow the steps;If not up to, thening follow the steps A15.
A14, terminate crawl.
A15, it executes and described filters out network address relevant to the preset crawl target according to web page analysis algorithm The step of link URL, until reaching the grasping condition of the setting.
Certainly, crawler backstage 11 grabs the process of webpage from network, also may include: according to preset crawl target
Since the URL of one or several Initial pages, the URL of Initial page is obtained, during grabbing webpage, no It is disconnected to extract new URL from current page and be put into queue, certain stop condition until meeting system.
In the present embodiment, after crawler backstage 11 grabs webpage, the webpage of crawl is edited according to unified format, and Edited web page contents are sent to crawler report real time processing system 12.
Unified format may refer to table 1.
Table 1
Crawler backstage 11 needs the content for the tables of data safeguarded can be such that
CREATE TABLE`SpiderSite`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`platform`varchar (100) NOT NULL DEFAULT " COMMENT' crawls the Type of website (end PC-PC WeChatPublicNumber- wechat public platform) ',
`siteName`varchar (500) NOT NULL DEFAULT " COMMENT' crawl web site name ',
`domain`varchar (500) NOT NULL DEFAULT " COMMENT' crawl website domain name ',
`admin`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' operator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0- Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' climbs Worm channel management table ';
CREATE TABLE`SpiderReport`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`spiderId`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' crawler id',
(1- has `msgType`tinyint (4) unsigned NOT NULL DEFAULT'0'COMMENT' type of message It is abnormal without 3- is updated to update 2-) ',
`platform`varchar (100) NOT NULL DEFAULT " COMMENT' crawls the Type of website (end PC-PC WeChatPublicNumber- wechat public platform) ',
`siteName`varchar (500) NOT NULL DEFAULT " COMMENT' crawl web site name ',
`domain`varchar (500) NOT NULL DEFAULT " COMMENT' crawl website domain name ',
`statisticsTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' crawl the time ',
`subjectTitle`varchar (1000) NOT NULL DEFAULT " COMMENT' crawl article title ',
`subjectUrl`varchar (1000) NOT NULL DEFAULT " COMMENT' crawl article link ',
`subjectDate`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' crawls article publication Date ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0- Normal 1- is deleted) ',
PRIMARY KEY(`id`),
KEY`idx_msgType`(`msgType`),
KEY`idx_platform`(`platform`),
KEY`idx_domain`(`domain`),
KEY`idx_subjectDate`(`subjectDate`),
KEY`idx_created`(`created`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' climbs Worm result data table ';
Crawler report real time processing system 12 is obtained for carrying out classification parsing to the webpage of 11 crawl of crawler backstage Database is stored in classification parsing result, and by the webpage of 11 crawl of the classification parsing result and crawler backstage.
Classification parsing is carried out to the webpage of 11 crawl of crawler backstage, it is possible to understand that are as follows: the crawler backstage 11 is grabbed The Update attribute of the webpage taken carries out classification parsing.The result of classification parsing can be to have update, without update or update abnormal.
Management backstage 13, for 11 crawl of the classification parsing result stored in the database and crawler backstage Webpage is managed.
As another alternative embodiment of the application, reference Fig. 2 is a kind of crawler system embodiment 2 provided by the present application Structural schematic diagram, the present embodiment is mainly the expansion scheme of the crawler system described to above-described embodiment 1, as shown in Fig. 2, scheming On the basis of crawler system shown in 1, can also include:
Distributed message middleware 14, it is real for the webpage of 11 crawl of crawler backstage to be passed to the crawler report When processing system 12.
The webpage of 11 crawl of crawler backstage is passed into the crawler report reality using distributed message middleware 14 When processing system 12, the efficiency of messaging can be improved.
A distributed, the more scene of distributed message middleware 14 multi-platform supported, is highly reliable based on Kafka/ The Message Queue of RocketMQ exploitation, distributed message middleware 14 is based on PHP language development, and provides PHP, C+ +, the supports of tri- platforms of Java.
The characteristics of relatively previous distributed information system, distributed message middleware 14 provided in this embodiment, is:
1. multi-platform support: providing the support of tri- platforms of PHP, C++, Java
2. distributed message middleware 14 can store the message that each receives, it means that for a collection of message, consumption End can consume repeatedly, can satisfy by specific API or operation and reach this requirement.
It is technological development save the cost 3. supporting more scenes are current to use.
In another embodiment of the application, management backstage 13 is introduced, refers to Fig. 3, management backstage 13 can To include: monitoring alarm subsystem 131.
Monitoring alarm subsystem 131, for monitoring the webpage of 11 crawl of the crawler stored in the database backstage Update with the presence or absence of abnormal;
It is abnormal if it exists, then abnormal alarm is carried out, and notify alert receipt people.
In another embodiment of the application, another management backstage 13 is provided, Fig. 4 is referred to, shown in Fig. 3 It can also include: crawler channel and report management subsystem 132 on the basis of management backstage 13.
Crawler channel and report management subsystem 132, for periodically retrieving the crawler stored in the database Column content is specified in appointed website in the webpage of 11 crawls from the background, obtains search result, and will be in column according to search result The website for having variation feeds back to administrator.
After the website that column content has variation is fed back to administrator according to search result, administrator can manually be sieved Choosing, by effective information input database.
It may refer to Fig. 5 for the interface of administrator's typing effective information, as shown in figure 5, typing element may include: channel Network address and channel title.
It clicks and increases newly, channel network address and channel title can be saved in list.
Inquiry is clicked, fuzzy matching can be carried out to the content in channel network address and channel title in the database, and will As a result it is shown in channel list.Wherein, channel list shows all network address contents in original state default.
Each single item content in channel list, the attended operation that can be deleted and be modified.
In another embodiment of the application, another management backstage 13 is provided, Fig. 6 is referred to, shown in Fig. 4 On the basis of management backstage 13, can also include:
Policies and regulations management subsystem 133 is used for according to unified format, to the crawler stored from the database The policies and regulations that select are edited in the webpages of 11 crawls from the background, obtain editable text, and by the editable text It is uploaded to the database, and inquiry of policy and regulation interface is provided, in the inquiry of policy and regulation interface search and displaying The editable text.
Inquiry of policy and regulation interface may refer to Fig. 7, as shown in fig. 7, inquiry of policy and regulation interface includes following content:
Inquiry control condition may include: file name, Origin, Originator, file classification, region rank, whether effectively.
It fills in wherein several any in inquiry control condition: clicking " inquiry " button, show query result, wherein file Entitled fuzzy matching.
Policies and regulations list:
Default conditions: all policies and regulations of typing are shown in the form of entry;
Entry attributes: file name, file classification, region rank, Origin, Originator, input system time, operation button (1. It modifies-clicks, jump newly-built/modification page;2. deleting-clicking, pop-up is confirmed whether to delete);
Sequence: it by classification sequence (preferential), sorts by entry time;
" newly-built " button is clicked, newly-built/modification page is jumped to.The newly-built or modification page may refer to Fig. 8, such as Fig. 8 institute Show, the attribute for including in the newly-built page or the modification page are as follows: file name, validity date, original text link, file classification, region Rank and Origin, Originator.The attribute that must wherein fill out are as follows: file name, file classification, region rank and Origin, Originator select the category filled out Property are as follows: validity date and original text link.
If attribute must be filled out not fill in, after clicking " preservation ", jumps out pop-up prompt: not filling in complete information!
It include toolbar in the newly-built page or the modification page, specific as follows:
A. the text edit tool column;
B. Text Entry;
C. attachment adds: click " attachment ", jump out pop-up: selection local file simultaneously uploads.
" preservation " button is clicked, file is saved, and automatically records " nearest modification time (date Hour Minute Second) " as text This attribute is shown as entry attributes " entry time (only display date) " in the page that summarizes of policies and regulations
It deletes
It clicks policies and regulations-and summarizes the page-entry operation button " deletion ", jump out pop-up: being confirmed whether to delete.
In the present embodiment, it may include following content that policies and regulations management subsystem 133, which needs the tables of data safeguarded:
CREATE TABLE`Policy`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`fileName`varchar (500) NOT NULL DEFAULT " COMMENT' file name ',
`fileCategoryId`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' file classification id',
`departmentId`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' Origin, Originator id',
`regionId`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' region rank id',
When `validStartTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' effectively starts Between ',
At the end of `validEndTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' is effective Between ',
The link of `originalUrl`varchar (1000) NOT NULL DEFAULT " COMMENT' original text ',
The link of `cdnUrl`varchar (1000) NOT NULL DEFAULT " COMMENT'cdn file ',
`admin`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' operator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0- Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' letter Breath logging data table ';
CREATE TABLE`PolicyDepartment`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`department`varchar (100) NOT NULL DEFAULT " COMMENT' Origin, Originator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0- Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' hair Literary unit management table ';
CREATE TABLE`PolicyRegion`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
The region `region`varchar (50) NOT NULL DEFAULT " COMMENT' rank ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0- Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) area ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' Domain level management table ';
CREATE TABLE`PolicyFileCategory`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`fileCategory`varchar (100) NOT NULL DEFAULT " COMMENT' file classification ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0- Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' text Part classification management table ';
In another embodiment of the application, another management backstage 13 is provided, Fig. 9 is referred to, shown in Fig. 6 On the basis of management backstage 13, can also include:
Risk report management subsystem 134 is used for typing risk report, and provides risk report search and show interface, with It is searched in the risk report and shows that interface receives risk report searching request, and respond the risk report searching request, In Risk report is inquired in the database, and is shown, the risk report is to be carried out based on the policies and regulations to business The report of risk assessment.
Risk report search shows that interface may refer to Figure 10, and as shown in Figure 10, risk report search shows that interface includes:
Inquire control condition: file name, corresponding policies and regulations, risk class;
If do not fill in inquiry control condition in any one: if list be shown as default conditions;
If filled in wherein several any in inquiry control condition: clicking " inquiry ", show query result, file name For fuzzy query.
Policies and regulations list:
Default conditions: all risk reports of typing are shown in the form of entry;
Entry attributes: file name, corresponding policies and regulations, risk class, the input system time, (1. repair operation button Change-click, jumps newly-built/modification page;2. deleting-clicking, pop-up is confirmed whether to delete);
Sequence: it by attribute " corresponding policies and regulations " sequence (preferential), sorts by entry time.
It clicks " newly-built " button: jumping to newly-built/modification page.The newly-built or modification page may refer to Figure 11, such as Figure 11 It is shown, it creates or the modification page includes: attribute: file name, corresponding policies and regulations, risk class.Wherein, file name, right It answers policies and regulations and risk class is required item.
If attribute must be filled out not fill in, after clicking " preservation ", jumps out pop-up prompt: not filling in complete information!
The newly-built or modification page further include: toolbar:
The text edit tool column;
Text Entry;
Attachment addition: clicking " attachment ", jumps out pop-up: selection local file simultaneously uploads;
" preservation " button is clicked, file is saved, and automatically records " nearest modification time (date Hour Minute Second) " as text This attribute is shown as entry attributes " entry time (only display date) " in the page that summarizes of risk report.
It deletes
It clicks risk report-and summarizes the page-entry operation button " deletion ", jump out pop-up: being confirmed whether to delete.
It may include the following contents that risk report management subsystem, which needs the tables of data safeguarded:
CREATE TABLE`RiskReport`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`fileName`varchar (500) NOT NULL DEFAULT " COMMENT' file name ',
`policyId`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' corresponds to policies and regulations id',
`riskLevel`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' risk class (0- without 3- high in the low 2- of 1-) ',
The link of `cdnUrl`varchar (1000) NOT NULL DEFAULT " COMMENT'cdn file ',
`admin`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' operator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0- Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' wind Dangerous address data table '.
In another embodiment of the application, another management backstage 13 is provided, referring to Figure 12, it is shown in Fig. 9 Management backstage 13 on the basis of, can also include:
Risk entry management subsystem 135 is used for typing risk entry, and provides risk vocabulary entry search and show interface, with It shows that interface receives the request of risk vocabulary entry search in the risk vocabulary entry search, and responds the risk vocabulary entry search request, In Risk entry is searched in the database, and shows risk entry, and the risk entry is based on the policies and regulations, from described The entry with business risk extracted in the webpage of 11 crawl of crawler backstage.
Whether the initial interface of typing risk entry may refer to Figure 13, as shown in figure 13, clicks and deletes, prompt to confirm and delete Except bullet frame;It clicks and increases, directly newly-increased blank line can be with typing relevant item;It clicks and saves, save newly-increased or modification interior Hold;Click on content can carry out the maintenance of content, and the interface of content maintenance may refer to Figure 14, and as shown in figure 14, click is deleted It removes, prompts whether to confirm that deletion plays frame.It clicks and increases, directly newly-increased blank line can be with typing relevant item.It clicks and saves, protect Deposit newly-increased or modification content.Editor's case is clicked, case editor can be carried out.The interface of case editor may refer to figure 15。
Risk vocabulary entry search shows that interface may refer to Figure 16.The inquiry button in Figure 16 is clicked, can be inquired specific Risk entry.The particular content for clicking Figure 16 risk entry, can consult content details, as shown in figure 17, click Case, can consult case content, and case content is as shown in figure 18.
It may include the following contents that risk entry management subsystem, which needs the tables of data safeguarded:
CREATE TABLE`RiskWordCategory`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`name`varchar (20) NOT NULL DEFAULT " COMMENT' category name ',
The description of `description`varchar (200) NOT NULL DEFAULT'0'COMMENT' classification ',
`admin`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' operator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0- Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' wind Dangerous dictionary scheme of classes ';
CREATE TABLE`RiskWordProperty`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`name`varchar (20) NOT NULL DEFAULT " COMMENT' Property Name ',
`description`varchar (200) NOT NULL DEFAULT'0'COMMENT' attribute description ',
`admin`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' operator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0- Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' wind Dangerous dictionary attribute list ';
CREATE TABLE`RiskWord`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`word`varchar (200) NOT NULL DEFAULT'0'COMMENT' word name ',
`categoryId`int (11) NOT NULL DEFAULT'0'COMMENT' dictionary classification id',
`propertyId`int (11) NOT NULL DEFAULT'0'COMMENT' dictionary attribute id',
`subCategory`varchar (200) NOT NULL DEFAULT'0'COMMENT' subcategory ',
`basis`varchar (200) NOT NULL DEFAULT'0'COMMENT' foundation ',
`ruleValidStartTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' regulation life Imitate the time ',
`ruleValidEndTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' regulation failure Time ',
When `validStartTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' entry comes into force Between ',
When `validEndTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' entry fails Between ',
`expireReason`varchar (500) NOT NULL DEFAULT'0'COMMENT' failure reason ',
The link of `cdnUrl`varchar (1000) NOT NULL DEFAULT " COMMENT'cdn case file ',
`extra`varchar (1000) NOT NULL DEFAULT " COMMENT' additional information ',
`admin`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' operator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0- Normal 1- is deleted) ',
PRIMARY KEY(`id`),
KEY`idx_categoryId`(`categoryId`),
KEY`idx_propertyId`(`propertyId`),
KEY`idx_subCategory`(`subCategory`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' wind Dangerous dictionary word list '.
In another embodiment of the application, another management backstage 13 is provided, referring to Figure 19, it is shown in Figure 12 Management backstage 13 on the basis of, can also include:
The report of crawler result shows subsystem 136, for showing the classification parsing result stored in the database and institute State the webpage of 11 crawl of crawler backstage.
Show that the interface of the webpage of 11 crawl of the classification parsing result stored in the database and crawler backstage can With referring to fig. 20, as shown in figure 20, clicks and report available details, the displaying interface of details may refer to figure 21。
Next crawler method provided by the present application is introduced, the crawler method being introduced below is climbed with described above Worm system can correspond to each other reference.
Crawler method may comprise steps of:
A11, according to preset crawl target, webpage is grabbed from network;
A12, classification parsing is carried out to the webpage of crawler backstage crawl, obtains classification parsing result, and will be described point Class parsing result and the webpage of crawler backstage crawl are stored in database;
A13, the webpage of the classification parsing result stored in the database and the crawl of crawler backstage is managed.
It should be noted that each embodiment focuses on the differences from other embodiments, each implementation Same and similar part may refer to each other between example.For device class embodiment, basic with embodiment of the method due to it Similar, so being described relatively simple, the relevent part can refer to the partial explaination of embodiments of method.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.
For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this The function of each unit can be realized in the same or multiple software and or hardware when application.
As seen through the above description of the embodiments, those skilled in the art can be understood that the application can It realizes by means of software and necessary general hardware platform.Based on this understanding, the technical solution essence of the application On in other words the part that contributes to existing technology can be embodied in the form of software products, the computer software product It can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer equipment (can be personal computer, server or the network equipment etc.) executes the certain of each embodiment of the application or embodiment Method described in part.
A kind of crawler system provided herein and method are described in detail above, it is used herein specifically The principle and implementation of this application are described for a example, the application that the above embodiments are only used to help understand Method and its core concept;At the same time, for those skilled in the art is being embodied according to the thought of the application There will be changes in mode and application range, in conclusion the contents of this specification should not be construed as limiting the present application.

Claims (10)

1. a kind of crawler system characterized by comprising
Crawler backstage, for grabbing webpage from network according to preset crawl target;
Crawler report real time processing system obtains classification solution for carrying out classification parsing to the webpage of crawler backstage crawl Analysis is as a result, and be stored in database for the webpage of the classification parsing result and the crawl of crawler backstage;
Management backstage is carried out for the webpage to the classification parsing result stored in the database and the crawl of crawler backstage Management.
2. crawler system according to claim 1, which is characterized in that the crawler backstage is specifically used for:
Filter out website links URL relevant to the preset crawl target according to web page analysis algorithm, will with it is described In the preset relevant URL deposit of crawl target URL queue to be captured;
According to search strategy from the URL queue to be captured, URL is chosen, as target URL, according to the target URL, Webpage is grabbed from network;
Judge whether the grasping condition for reaching setting;
If reaching, terminate to grab;
If not up to, executing and described filtering out net relevant to the preset crawl target according to web page analysis algorithm The step of location link URL, until reaching the grasping condition of the setting.
3. crawler system according to claim 1, which is characterized in that the crawler system further include:
Distributed message middleware, for the webpage of crawler backstage crawl to be passed to the real-time processing system of crawler report System.
4. crawler system according to claim 1, which is characterized in that the management backstage, comprising:
Monitoring alarm subsystem, for monitor stored in the database the crawler backstage crawl webpage update whether There are exceptions;
It is abnormal if it exists, then abnormal alarm is carried out, and notify alert receipt people.
5. crawler system according to claim 4, which is characterized in that the management backstage, further includes:
Crawler channel and report management subsystem are grabbed for periodically retrieving the crawler stored in the database backstage Webpage in column content is specified in appointed website, obtain search result, and column content is had into variation according to search result Website feeds back to administrator.
6. crawler system according to claim 5, which is characterized in that the management backstage, further includes:
Policies and regulations management subsystem, for being grabbed to the crawler backstage stored from the database according to unified format The policies and regulations selected in the webpage taken are edited, and obtain editable text, and the editable text is uploaded to institute State database, and inquiry of policy and regulation interface be provided, for the inquiry of policy and regulation interface search and show described in can compile Collect text.
7. crawler system according to claim 6, which is characterized in that the management backstage, further includes:
Risk report management subsystem is used for typing risk report, and provides risk report search and show interface, in the wind Dangerous search report shows that interface receives risk report searching request, and responds the risk report searching request, in the data Risk report is inquired in library, and is shown, the risk report is to carry out risk assessment to business based on the policies and regulations Report.
8. crawler system according to claim 7, which is characterized in that the management backstage, further includes:
Risk entry management subsystem is used for typing risk entry, and provides risk vocabulary entry search and show interface, in the wind Dangerous vocabulary entry search shows that interface receives the request of risk vocabulary entry search, and responds the risk vocabulary entry search request, in the data Risk entry is searched in library, and shows risk entry, and the risk entry is based on the policies and regulations, from the crawler backstage The entry with business risk extracted in the webpage of crawl.
9. crawler system according to claim 8, which is characterized in that the management backstage, further includes:
The report of crawler result shows subsystem, after showing the classification parsing result stored in the database and the crawler The webpage of platform crawl.
10. a kind of crawler method characterized by comprising
According to preset crawl target, webpage is grabbed from network;
Classification parsing is carried out to the webpage of crawler backstage crawl, obtains classification parsing result, and the classification is parsed and is tied The webpage of fruit and the crawl of crawler backstage is stored in database;
The webpage of the classification parsing result stored in the database and the crawl of crawler backstage is managed.
CN201910807818.7A 2019-08-29 2019-08-29 A kind of crawler system and method Pending CN110516135A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910807818.7A CN110516135A (en) 2019-08-29 2019-08-29 A kind of crawler system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910807818.7A CN110516135A (en) 2019-08-29 2019-08-29 A kind of crawler system and method

Publications (1)

Publication Number Publication Date
CN110516135A true CN110516135A (en) 2019-11-29

Family

ID=68627865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910807818.7A Pending CN110516135A (en) 2019-08-29 2019-08-29 A kind of crawler system and method

Country Status (1)

Country Link
CN (1) CN110516135A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094382A (en) * 2021-04-02 2021-07-09 南开大学 Semi-automatic data acquisition and updating method for multi-source data management

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339292A (en) * 2010-07-27 2012-02-01 中国电信股份有限公司 Distributed searching method and system
CN103488750A (en) * 2013-09-24 2014-01-01 长沙裕邦软件开发有限公司 Implementation method and system of network robot
CN105608134A (en) * 2015-12-18 2016-05-25 盐城工学院 Multithreading-based web crawler system and web crawling method thereof
CN109063144A (en) * 2018-08-07 2018-12-21 广州金猫信息技术服务有限公司 Visual network crawler method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102339292A (en) * 2010-07-27 2012-02-01 中国电信股份有限公司 Distributed searching method and system
CN103488750A (en) * 2013-09-24 2014-01-01 长沙裕邦软件开发有限公司 Implementation method and system of network robot
CN105608134A (en) * 2015-12-18 2016-05-25 盐城工学院 Multithreading-based web crawler system and web crawling method thereof
CN109063144A (en) * 2018-08-07 2018-12-21 广州金猫信息技术服务有限公司 Visual network crawler method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113094382A (en) * 2021-04-02 2021-07-09 南开大学 Semi-automatic data acquisition and updating method for multi-source data management

Similar Documents

Publication Publication Date Title
US6983320B1 (en) System, method and computer program product for analyzing e-commerce competition of an entity by utilizing predetermined entity-specific metrics and analyzed statistics from web pages
US8321396B2 (en) Automatically extracting by-line information
US8095554B1 (en) Global inventory warehouse
US6199081B1 (en) Automatic tagging of documents and exclusion by content
US8463824B2 (en) Ecosystem method of aggregation and search and related techniques
US20030204485A1 (en) Gathering change data from information provider network indicative of event changes at information provider node relative to pre-stored information in the database by information collection agents
DE202015009301U1 (en) Automatic crawling of applications
US20080201318A1 (en) Method and system for retrieving network documents
US20030110106A1 (en) System and method for enabling content providers in a financial services organization to self-publish content
CN1540552A (en) Computer search with correlation
Zhang et al. Developing a dark web collection and infrastructure for computational and social sciences
CN101484892A (en) A method of managing web services using integrated document
US8484217B1 (en) Knowledge discovery appliance
CN103716394B (en) Download the management method and device of file
CN107784113A (en) Html web page collecting method, device and computer-readable recording medium
US20220215065A1 (en) Intelligent browser bookmark management
US20070022110A1 (en) Method for processing information, apparatus therefor and program therefor
CN110516135A (en) A kind of crawler system and method
US20020169792A1 (en) Method and system for archiving data within a predetermined time interval
CN109871476A (en) A kind of system automatically generating contact person's address list
Kitamoto Digital typhoon: Near real-time aggregation, recombination and delivery of typhoon-related information
Porkodi et al. An improved association rule mining technique for xml data using xquery and apriori algorithm
CN111859867B (en) Web data extraction system based on XML and XPath and use method thereof
Englefield et al. Spatial fire management system
JP2005327297A (en) Knowledge information collecting system and knowledge information collecting method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191129

RJ01 Rejection of invention patent application after publication