CN110516135A - A kind of crawler system and method - Google Patents
A kind of crawler system and method Download PDFInfo
- Publication number
- CN110516135A CN110516135A CN201910807818.7A CN201910807818A CN110516135A CN 110516135 A CN110516135 A CN 110516135A CN 201910807818 A CN201910807818 A CN 201910807818A CN 110516135 A CN110516135 A CN 110516135A
- Authority
- CN
- China
- Prior art keywords
- crawler
- backstage
- webpage
- crawl
- risk
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 238000012545 processing Methods 0.000 claims abstract description 12
- 230000002159 abnormal effect Effects 0.000 claims description 9
- 238000012544 monitoring process Methods 0.000 claims description 6
- 238000012502 risk assessment Methods 0.000 claims description 2
- 235000013399 edible fruits Nutrition 0.000 claims 1
- 238000001914 filtration Methods 0.000 claims 1
- 238000010586 diagram Methods 0.000 description 22
- 238000012986 modification Methods 0.000 description 16
- 230000004048 modification Effects 0.000 description 16
- 230000006870 function Effects 0.000 description 10
- 238000010276 construction Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 5
- 238000004321 preservation Methods 0.000 description 4
- 238000012217 deletion Methods 0.000 description 3
- 230000037430 deletion Effects 0.000 description 3
- 238000012423 maintenance Methods 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000009194 climbing Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010365 information processing Effects 0.000 description 1
- 230000009191 jumping Effects 0.000 description 1
- 230000008140 language development Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application provides a kind of crawler system and method, crawler system includes: crawler backstage, for grabbing webpage from network according to preset crawl target;Crawler report real time processing system obtains classification parsing result, and the webpage of the classification parsing result and the crawl of crawler backstage is stored in database for carrying out classification parsing to the webpage of crawler backstage crawl;Management backstage is managed for the webpage to the classification parsing result stored in the database and the crawl of crawler backstage.In this application, the practicability of crawler system can be improved in the above manner.
Description
Technical field
This application involves technical field of information processing, in particular to a kind of crawler system and method.
Background technique
Network crawler system is a kind of system that webpage is automatically grabbed from network.The webpage of crawl is specifically supplied to by it
Third party's (e.g., search engine) uses.
But currently, the function of network crawler system is only limitted to crawl webpage, network crawler system is had a single function, real
It is not high with property.
Summary of the invention
In order to solve the above technical problems, the embodiment of the present application provides a kind of crawler system and method, crawler is improved to reach
The purpose of the practicability of system, technical solution are as follows:
A kind of crawler system, comprising:
Crawler backstage, for grabbing webpage from network according to preset crawl target;
Crawler report real time processing system is divided for carrying out classification parsing to the webpage of crawler backstage crawl
Class parsing result, and the webpage of the classification parsing result and the crawl of crawler backstage is stored in database;
Management backstage, for the webpage to the classification parsing result stored in the database and the crawl of crawler backstage
It is managed.
Preferably, the crawler backstage, is specifically used for:
Website links URL relevant to the preset crawl target is filtered out according to web page analysis algorithm, it will be with
In the relevant URL deposit of preset crawl target URL queue to be captured;
According to search strategy from the URL queue to be captured, URL is chosen, as target URL, according to the target
URL grabs webpage from network;
Judge whether the grasping condition for reaching setting;
If reaching, terminate to grab;
If not up to, execute it is described according to web page analysis algorithm filter out it is related to the preset crawl target
Website links URL the step of, until reaching the grasping condition of the setting.
Preferably, the crawler system further include:
Distributed message middleware is located in real time for the webpage of crawler backstage crawl to be passed to the crawler report
Reason system.
Preferably, the management backstage, comprising:
Monitoring alarm subsystem, the update of the webpage for monitoring the crawl of the crawler stored in the database backstage
With the presence or absence of exception;
It is abnormal if it exists, then abnormal alarm is carried out, and notify alert receipt people.
Preferably, the management backstage, further includes:
Crawler channel and report management subsystem, for periodically retrieving the crawler stored in the database backstage
Column content is specified in the webpage of crawl in appointed website, obtains search result, and column content is had into change according to search result
Dynamic website feeds back to administrator.
Preferably, the management backstage, further includes:
Policies and regulations management subsystem is used for according to unified format, after the crawler stored from the database
The policies and regulations selected in the webpage of platform crawl are edited, and obtain editable text, and the editable text is uploaded
To the database, and inquiry of policy and regulation interface is provided, in the inquiry of policy and regulation interface search and described in showing
Editable text.
Preferably, the management backstage, further includes:
Risk report management subsystem is used for typing risk report, and provides risk report search and show interface, in institute
It states risk report search and shows that interface receives risk report searching request, and respond the risk report searching request, described
Risk report is inquired in database, and is shown, the risk report is to carry out risk to business based on the policies and regulations
The report of assessment.
Preferably, the management backstage, further includes:
Risk entry management subsystem is used for typing risk entry, and provides risk vocabulary entry search and show interface, in institute
It states risk vocabulary entry search and shows that interface receives the request of risk vocabulary entry search, and respond the risk vocabulary entry search request, described
Risk entry is searched in database, and shows risk entry, and the risk entry is based on the policies and regulations, from the crawler
The entry with business risk extracted in the webpage of backstage crawl.
Preferably, the management backstage, further includes:
The report of crawler result shows subsystem, for showing the classification parsing result stored in the database and described climbing
The webpage of worm backstage crawl.
A kind of crawler method, comprising:
According to preset crawl target, webpage is grabbed from network;
Classification parsing is carried out to the webpage of crawler backstage crawl, obtains classification parsing result, and the classification is solved
The webpage for analysing result and the crawl of crawler backstage is stored in database;
The webpage of the classification parsing result stored in the database and the crawl of crawler backstage is managed.
Compared with prior art, the application has the beneficial effect that
In this application, a kind of crawler system, including crawler backstage, crawler report real time processing system and management backstage are provided,
Crawler backstage grabs webpage according to preset crawl target from network, realizes the function of crawl webpage, and crawler report is real
When processing system realize information computerization collecting function, management backstage realizes information management function, is compared to traditional crawler
System only has the function of crawl webpage, and the function of the crawler system of the application is expanded, and the practicability of crawler system can be improved.
Detailed description of the invention
In order to more clearly explain the technical solutions in the embodiments of the present application, needed in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, the drawings in the following description are only some examples of the present application, for this field
For those of ordinary skill, without any creative labor, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a kind of logical construction schematic diagram of crawler system provided by the present application;
Fig. 2 is the logical construction schematic diagram of another crawler system provided by the present application;
Fig. 3 is the logical construction schematic diagram of another crawler system provided by the present application;
Fig. 4 is the logical construction schematic diagram of another crawler system provided by the present application;
Fig. 5 is a kind of interface schematic diagram of typing effective information provided by the present application;
Fig. 6 is the logical construction schematic diagram of another crawler system provided by the present application;
Fig. 7 is a kind of schematic diagram at inquiry of policy and regulation interface provided by the present application;
Fig. 8 is provided by the present application a kind of newly-built or the modification page the schematic diagram;
Fig. 9 is the logical construction schematic diagram of another crawler system provided by the present application;
Figure 10 is the schematic diagram that a kind of risk report search provided by the present application shows interface;
Figure 11 is provided by the present application another newly-built or the modification page the schematic diagram;
Figure 12 is the logical construction schematic diagram of another crawler system provided by the present application;
Figure 13 is the schematic diagram of the initial interface of typing risk entry provided by the present application;
Figure 14 is a kind of schematic diagram at the interface of content maintenance provided by the present application;
Figure 15 is a kind of schematic diagram at the interface of case editor provided by the present application;
Figure 16 is the schematic diagram that a kind of risk vocabulary entry search provided by the present application shows interface;
Figure 17 is a kind of displaying interface schematic diagram of the content details of risk entry provided by the present application;
Figure 18 is a kind of schematic diagram at the displaying interface of case content provided by the present application;
Figure 19 is the logical construction schematic diagram of another crawler system provided by the present application;
Figure 20 is a kind of schematic diagram at web page display interface provided by the present application;
Figure 21 is a kind of displaying interface schematic diagram of details provided by the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on
Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall in the protection scope of this application.
The embodiment of the present application discloses a kind of crawler system, comprising: crawler backstage, for according to preset crawl mesh
Mark, grabs webpage from network;Crawler report real time processing system, for classifying to the webpage of crawler backstage crawl
Parsing obtains classification parsing result, and the webpage of the classification parsing result and the crawl of crawler backstage is stored in database;
Management backstage carries out pipe for the webpage to the classification parsing result stored in the database and the crawl of crawler backstage
Reason.In this application, the function of crawler system is expanded, and improves the practicability of crawler system.
Next crawler system disclosed in the embodiment of the present application is introduced, as shown in Figure 1, crawler system includes:
Crawler backstage 11, crawler report real time processing system 12 and management backstage 13.
Crawler backstage 11, for grabbing webpage from network according to preset crawl target.
Preferably, crawler backstage 11 grabs the process of webpage from network, may include: according to preset crawl target
A11, website links URL relevant to the preset crawl target is filtered out according to web page analysis algorithm,
URL relevant to the preset crawl target is stored in URL queue to be captured;
A12, foundation search strategy choose URL from the URL queue to be captured, as target URL, according to described in
Target URL, grabs webpage from network;
A13, judge whether the grasping condition for reaching setting;
If reaching, A14 is thened follow the steps;If not up to, thening follow the steps A15.
A14, terminate crawl.
A15, it executes and described filters out network address relevant to the preset crawl target according to web page analysis algorithm
The step of link URL, until reaching the grasping condition of the setting.
Certainly, crawler backstage 11 grabs the process of webpage from network, also may include: according to preset crawl target
Since the URL of one or several Initial pages, the URL of Initial page is obtained, during grabbing webpage, no
It is disconnected to extract new URL from current page and be put into queue, certain stop condition until meeting system.
In the present embodiment, after crawler backstage 11 grabs webpage, the webpage of crawl is edited according to unified format, and
Edited web page contents are sent to crawler report real time processing system 12.
Unified format may refer to table 1.
Table 1
Crawler backstage 11 needs the content for the tables of data safeguarded can be such that
CREATE TABLE`SpiderSite`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`platform`varchar (100) NOT NULL DEFAULT " COMMENT' crawls the Type of website (end PC-PC
WeChatPublicNumber- wechat public platform) ',
`siteName`varchar (500) NOT NULL DEFAULT " COMMENT' crawl web site name ',
`domain`varchar (500) NOT NULL DEFAULT " COMMENT' crawl website domain name ',
`admin`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' operator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0-
Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' climbs
Worm channel management table ';
CREATE TABLE`SpiderReport`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`spiderId`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' crawler id',
(1- has `msgType`tinyint (4) unsigned NOT NULL DEFAULT'0'COMMENT' type of message
It is abnormal without 3- is updated to update 2-) ',
`platform`varchar (100) NOT NULL DEFAULT " COMMENT' crawls the Type of website (end PC-PC
WeChatPublicNumber- wechat public platform) ',
`siteName`varchar (500) NOT NULL DEFAULT " COMMENT' crawl web site name ',
`domain`varchar (500) NOT NULL DEFAULT " COMMENT' crawl website domain name ',
`statisticsTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' crawl the time ',
`subjectTitle`varchar (1000) NOT NULL DEFAULT " COMMENT' crawl article title ',
`subjectUrl`varchar (1000) NOT NULL DEFAULT " COMMENT' crawl article link ',
`subjectDate`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' crawls article publication
Date ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0-
Normal 1- is deleted) ',
PRIMARY KEY(`id`),
KEY`idx_msgType`(`msgType`),
KEY`idx_platform`(`platform`),
KEY`idx_domain`(`domain`),
KEY`idx_subjectDate`(`subjectDate`),
KEY`idx_created`(`created`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' climbs
Worm result data table ';
Crawler report real time processing system 12 is obtained for carrying out classification parsing to the webpage of 11 crawl of crawler backstage
Database is stored in classification parsing result, and by the webpage of 11 crawl of the classification parsing result and crawler backstage.
Classification parsing is carried out to the webpage of 11 crawl of crawler backstage, it is possible to understand that are as follows: the crawler backstage 11 is grabbed
The Update attribute of the webpage taken carries out classification parsing.The result of classification parsing can be to have update, without update or update abnormal.
Management backstage 13, for 11 crawl of the classification parsing result stored in the database and crawler backstage
Webpage is managed.
As another alternative embodiment of the application, reference Fig. 2 is a kind of crawler system embodiment 2 provided by the present application
Structural schematic diagram, the present embodiment is mainly the expansion scheme of the crawler system described to above-described embodiment 1, as shown in Fig. 2, scheming
On the basis of crawler system shown in 1, can also include:
Distributed message middleware 14, it is real for the webpage of 11 crawl of crawler backstage to be passed to the crawler report
When processing system 12.
The webpage of 11 crawl of crawler backstage is passed into the crawler report reality using distributed message middleware 14
When processing system 12, the efficiency of messaging can be improved.
A distributed, the more scene of distributed message middleware 14 multi-platform supported, is highly reliable based on Kafka/
The Message Queue of RocketMQ exploitation, distributed message middleware 14 is based on PHP language development, and provides PHP, C+
+, the supports of tri- platforms of Java.
The characteristics of relatively previous distributed information system, distributed message middleware 14 provided in this embodiment, is:
1. multi-platform support: providing the support of tri- platforms of PHP, C++, Java
2. distributed message middleware 14 can store the message that each receives, it means that for a collection of message, consumption
End can consume repeatedly, can satisfy by specific API or operation and reach this requirement.
It is technological development save the cost 3. supporting more scenes are current to use.
In another embodiment of the application, management backstage 13 is introduced, refers to Fig. 3, management backstage 13 can
To include: monitoring alarm subsystem 131.
Monitoring alarm subsystem 131, for monitoring the webpage of 11 crawl of the crawler stored in the database backstage
Update with the presence or absence of abnormal;
It is abnormal if it exists, then abnormal alarm is carried out, and notify alert receipt people.
In another embodiment of the application, another management backstage 13 is provided, Fig. 4 is referred to, shown in Fig. 3
It can also include: crawler channel and report management subsystem 132 on the basis of management backstage 13.
Crawler channel and report management subsystem 132, for periodically retrieving the crawler stored in the database
Column content is specified in appointed website in the webpage of 11 crawls from the background, obtains search result, and will be in column according to search result
The website for having variation feeds back to administrator.
After the website that column content has variation is fed back to administrator according to search result, administrator can manually be sieved
Choosing, by effective information input database.
It may refer to Fig. 5 for the interface of administrator's typing effective information, as shown in figure 5, typing element may include: channel
Network address and channel title.
It clicks and increases newly, channel network address and channel title can be saved in list.
Inquiry is clicked, fuzzy matching can be carried out to the content in channel network address and channel title in the database, and will
As a result it is shown in channel list.Wherein, channel list shows all network address contents in original state default.
Each single item content in channel list, the attended operation that can be deleted and be modified.
In another embodiment of the application, another management backstage 13 is provided, Fig. 6 is referred to, shown in Fig. 4
On the basis of management backstage 13, can also include:
Policies and regulations management subsystem 133 is used for according to unified format, to the crawler stored from the database
The policies and regulations that select are edited in the webpages of 11 crawls from the background, obtain editable text, and by the editable text
It is uploaded to the database, and inquiry of policy and regulation interface is provided, in the inquiry of policy and regulation interface search and displaying
The editable text.
Inquiry of policy and regulation interface may refer to Fig. 7, as shown in fig. 7, inquiry of policy and regulation interface includes following content:
Inquiry control condition may include: file name, Origin, Originator, file classification, region rank, whether effectively.
It fills in wherein several any in inquiry control condition: clicking " inquiry " button, show query result, wherein file
Entitled fuzzy matching.
Policies and regulations list:
Default conditions: all policies and regulations of typing are shown in the form of entry;
Entry attributes: file name, file classification, region rank, Origin, Originator, input system time, operation button (1.
It modifies-clicks, jump newly-built/modification page;2. deleting-clicking, pop-up is confirmed whether to delete);
Sequence: it by classification sequence (preferential), sorts by entry time;
" newly-built " button is clicked, newly-built/modification page is jumped to.The newly-built or modification page may refer to Fig. 8, such as Fig. 8 institute
Show, the attribute for including in the newly-built page or the modification page are as follows: file name, validity date, original text link, file classification, region
Rank and Origin, Originator.The attribute that must wherein fill out are as follows: file name, file classification, region rank and Origin, Originator select the category filled out
Property are as follows: validity date and original text link.
If attribute must be filled out not fill in, after clicking " preservation ", jumps out pop-up prompt: not filling in complete information!
It include toolbar in the newly-built page or the modification page, specific as follows:
A. the text edit tool column;
B. Text Entry;
C. attachment adds: click " attachment ", jump out pop-up: selection local file simultaneously uploads.
" preservation " button is clicked, file is saved, and automatically records " nearest modification time (date Hour Minute Second) " as text
This attribute is shown as entry attributes " entry time (only display date) " in the page that summarizes of policies and regulations
It deletes
It clicks policies and regulations-and summarizes the page-entry operation button " deletion ", jump out pop-up: being confirmed whether to delete.
In the present embodiment, it may include following content that policies and regulations management subsystem 133, which needs the tables of data safeguarded:
CREATE TABLE`Policy`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`fileName`varchar (500) NOT NULL DEFAULT " COMMENT' file name ',
`fileCategoryId`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' file classification
id',
`departmentId`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' Origin, Originator id',
`regionId`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' region rank id',
When `validStartTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' effectively starts
Between ',
At the end of `validEndTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' is effective
Between ',
The link of `originalUrl`varchar (1000) NOT NULL DEFAULT " COMMENT' original text ',
The link of `cdnUrl`varchar (1000) NOT NULL DEFAULT " COMMENT'cdn file ',
`admin`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' operator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0-
Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' letter
Breath logging data table ';
CREATE TABLE`PolicyDepartment`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`department`varchar (100) NOT NULL DEFAULT " COMMENT' Origin, Originator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0-
Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' hair
Literary unit management table ';
CREATE TABLE`PolicyRegion`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
The region `region`varchar (50) NOT NULL DEFAULT " COMMENT' rank ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0-
Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) area ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT='
Domain level management table ';
CREATE TABLE`PolicyFileCategory`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`fileCategory`varchar (100) NOT NULL DEFAULT " COMMENT' file classification ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0-
Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' text
Part classification management table ';
In another embodiment of the application, another management backstage 13 is provided, Fig. 9 is referred to, shown in Fig. 6
On the basis of management backstage 13, can also include:
Risk report management subsystem 134 is used for typing risk report, and provides risk report search and show interface, with
It is searched in the risk report and shows that interface receives risk report searching request, and respond the risk report searching request, In
Risk report is inquired in the database, and is shown, the risk report is to be carried out based on the policies and regulations to business
The report of risk assessment.
Risk report search shows that interface may refer to Figure 10, and as shown in Figure 10, risk report search shows that interface includes:
Inquire control condition: file name, corresponding policies and regulations, risk class;
If do not fill in inquiry control condition in any one: if list be shown as default conditions;
If filled in wherein several any in inquiry control condition: clicking " inquiry ", show query result, file name
For fuzzy query.
Policies and regulations list:
Default conditions: all risk reports of typing are shown in the form of entry;
Entry attributes: file name, corresponding policies and regulations, risk class, the input system time, (1. repair operation button
Change-click, jumps newly-built/modification page;2. deleting-clicking, pop-up is confirmed whether to delete);
Sequence: it by attribute " corresponding policies and regulations " sequence (preferential), sorts by entry time.
It clicks " newly-built " button: jumping to newly-built/modification page.The newly-built or modification page may refer to Figure 11, such as Figure 11
It is shown, it creates or the modification page includes: attribute: file name, corresponding policies and regulations, risk class.Wherein, file name, right
It answers policies and regulations and risk class is required item.
If attribute must be filled out not fill in, after clicking " preservation ", jumps out pop-up prompt: not filling in complete information!
The newly-built or modification page further include: toolbar:
The text edit tool column;
Text Entry;
Attachment addition: clicking " attachment ", jumps out pop-up: selection local file simultaneously uploads;
" preservation " button is clicked, file is saved, and automatically records " nearest modification time (date Hour Minute Second) " as text
This attribute is shown as entry attributes " entry time (only display date) " in the page that summarizes of risk report.
It deletes
It clicks risk report-and summarizes the page-entry operation button " deletion ", jump out pop-up: being confirmed whether to delete.
It may include the following contents that risk report management subsystem, which needs the tables of data safeguarded:
CREATE TABLE`RiskReport`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`fileName`varchar (500) NOT NULL DEFAULT " COMMENT' file name ',
`policyId`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' corresponds to policies and regulations id',
`riskLevel`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' risk class (0- without
3- high in the low 2- of 1-) ',
The link of `cdnUrl`varchar (1000) NOT NULL DEFAULT " COMMENT'cdn file ',
`admin`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' operator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0-
Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' wind
Dangerous address data table '.
In another embodiment of the application, another management backstage 13 is provided, referring to Figure 12, it is shown in Fig. 9
Management backstage 13 on the basis of, can also include:
Risk entry management subsystem 135 is used for typing risk entry, and provides risk vocabulary entry search and show interface, with
It shows that interface receives the request of risk vocabulary entry search in the risk vocabulary entry search, and responds the risk vocabulary entry search request, In
Risk entry is searched in the database, and shows risk entry, and the risk entry is based on the policies and regulations, from described
The entry with business risk extracted in the webpage of 11 crawl of crawler backstage.
Whether the initial interface of typing risk entry may refer to Figure 13, as shown in figure 13, clicks and deletes, prompt to confirm and delete
Except bullet frame;It clicks and increases, directly newly-increased blank line can be with typing relevant item;It clicks and saves, save newly-increased or modification interior
Hold;Click on content can carry out the maintenance of content, and the interface of content maintenance may refer to Figure 14, and as shown in figure 14, click is deleted
It removes, prompts whether to confirm that deletion plays frame.It clicks and increases, directly newly-increased blank line can be with typing relevant item.It clicks and saves, protect
Deposit newly-increased or modification content.Editor's case is clicked, case editor can be carried out.The interface of case editor may refer to figure
15。
Risk vocabulary entry search shows that interface may refer to Figure 16.The inquiry button in Figure 16 is clicked, can be inquired specific
Risk entry.The particular content for clicking Figure 16 risk entry, can consult content details, as shown in figure 17, click
Case, can consult case content, and case content is as shown in figure 18.
It may include the following contents that risk entry management subsystem, which needs the tables of data safeguarded:
CREATE TABLE`RiskWordCategory`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`name`varchar (20) NOT NULL DEFAULT " COMMENT' category name ',
The description of `description`varchar (200) NOT NULL DEFAULT'0'COMMENT' classification ',
`admin`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' operator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0-
Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' wind
Dangerous dictionary scheme of classes ';
CREATE TABLE`RiskWordProperty`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`name`varchar (20) NOT NULL DEFAULT " COMMENT' Property Name ',
`description`varchar (200) NOT NULL DEFAULT'0'COMMENT' attribute description ',
`admin`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' operator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0-
Normal 1- is deleted) ',
PRIMARY KEY(`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' wind
Dangerous dictionary attribute list ';
CREATE TABLE`RiskWord`(
`id`bigint(20)unsigned NOT NULL AUTO_INCREMENT,
`word`varchar (200) NOT NULL DEFAULT'0'COMMENT' word name ',
`categoryId`int (11) NOT NULL DEFAULT'0'COMMENT' dictionary classification id',
`propertyId`int (11) NOT NULL DEFAULT'0'COMMENT' dictionary attribute id',
`subCategory`varchar (200) NOT NULL DEFAULT'0'COMMENT' subcategory ',
`basis`varchar (200) NOT NULL DEFAULT'0'COMMENT' foundation ',
`ruleValidStartTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' regulation life
Imitate the time ',
`ruleValidEndTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' regulation failure
Time ',
When `validStartTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' entry comes into force
Between ',
When `validEndTime`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' entry fails
Between ',
`expireReason`varchar (500) NOT NULL DEFAULT'0'COMMENT' failure reason ',
The link of `cdnUrl`varchar (1000) NOT NULL DEFAULT " COMMENT'cdn case file ',
`extra`varchar (1000) NOT NULL DEFAULT " COMMENT' additional information ',
`admin`bigint (20) unsigned NOT NULL DEFAULT'0'COMMENT' operator ',
`created`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' creation time ',
`updated`int (11) unsigned NOT NULL DEFAULT'0'COMMENT' renewal time ',
Whether `isDeleted`tinyint (2) unsigned NOT NULL DEFAULT'0'COMMENT' deletes (0-
Normal 1- is deleted) ',
PRIMARY KEY(`id`),
KEY`idx_categoryId`(`categoryId`),
KEY`idx_propertyId`(`propertyId`),
KEY`idx_subCategory`(`subCategory`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8 COMMENT=' wind
Dangerous dictionary word list '.
In another embodiment of the application, another management backstage 13 is provided, referring to Figure 19, it is shown in Figure 12
Management backstage 13 on the basis of, can also include:
The report of crawler result shows subsystem 136, for showing the classification parsing result stored in the database and institute
State the webpage of 11 crawl of crawler backstage.
Show that the interface of the webpage of 11 crawl of the classification parsing result stored in the database and crawler backstage can
With referring to fig. 20, as shown in figure 20, clicks and report available details, the displaying interface of details may refer to figure
21。
Next crawler method provided by the present application is introduced, the crawler method being introduced below is climbed with described above
Worm system can correspond to each other reference.
Crawler method may comprise steps of:
A11, according to preset crawl target, webpage is grabbed from network;
A12, classification parsing is carried out to the webpage of crawler backstage crawl, obtains classification parsing result, and will be described point
Class parsing result and the webpage of crawler backstage crawl are stored in database;
A13, the webpage of the classification parsing result stored in the database and the crawl of crawler backstage is managed.
It should be noted that each embodiment focuses on the differences from other embodiments, each implementation
Same and similar part may refer to each other between example.For device class embodiment, basic with embodiment of the method due to it
Similar, so being described relatively simple, the relevent part can refer to the partial explaination of embodiments of method.
Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by
One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation
Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning
Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that
A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or
The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged
Except there is also other identical elements in the process, method, article or apparatus that includes the element.
For convenience of description, it is divided into various units when description apparatus above with function to describe respectively.Certainly, implementing this
The function of each unit can be realized in the same or multiple software and or hardware when application.
As seen through the above description of the embodiments, those skilled in the art can be understood that the application can
It realizes by means of software and necessary general hardware platform.Based on this understanding, the technical solution essence of the application
On in other words the part that contributes to existing technology can be embodied in the form of software products, the computer software product
It can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are used so that a computer equipment
(can be personal computer, server or the network equipment etc.) executes the certain of each embodiment of the application or embodiment
Method described in part.
A kind of crawler system provided herein and method are described in detail above, it is used herein specifically
The principle and implementation of this application are described for a example, the application that the above embodiments are only used to help understand
Method and its core concept;At the same time, for those skilled in the art is being embodied according to the thought of the application
There will be changes in mode and application range, in conclusion the contents of this specification should not be construed as limiting the present application.
Claims (10)
1. a kind of crawler system characterized by comprising
Crawler backstage, for grabbing webpage from network according to preset crawl target;
Crawler report real time processing system obtains classification solution for carrying out classification parsing to the webpage of crawler backstage crawl
Analysis is as a result, and be stored in database for the webpage of the classification parsing result and the crawl of crawler backstage;
Management backstage is carried out for the webpage to the classification parsing result stored in the database and the crawl of crawler backstage
Management.
2. crawler system according to claim 1, which is characterized in that the crawler backstage is specifically used for:
Filter out website links URL relevant to the preset crawl target according to web page analysis algorithm, will with it is described
In the preset relevant URL deposit of crawl target URL queue to be captured;
According to search strategy from the URL queue to be captured, URL is chosen, as target URL, according to the target URL,
Webpage is grabbed from network;
Judge whether the grasping condition for reaching setting;
If reaching, terminate to grab;
If not up to, executing and described filtering out net relevant to the preset crawl target according to web page analysis algorithm
The step of location link URL, until reaching the grasping condition of the setting.
3. crawler system according to claim 1, which is characterized in that the crawler system further include:
Distributed message middleware, for the webpage of crawler backstage crawl to be passed to the real-time processing system of crawler report
System.
4. crawler system according to claim 1, which is characterized in that the management backstage, comprising:
Monitoring alarm subsystem, for monitor stored in the database the crawler backstage crawl webpage update whether
There are exceptions;
It is abnormal if it exists, then abnormal alarm is carried out, and notify alert receipt people.
5. crawler system according to claim 4, which is characterized in that the management backstage, further includes:
Crawler channel and report management subsystem are grabbed for periodically retrieving the crawler stored in the database backstage
Webpage in column content is specified in appointed website, obtain search result, and column content is had into variation according to search result
Website feeds back to administrator.
6. crawler system according to claim 5, which is characterized in that the management backstage, further includes:
Policies and regulations management subsystem, for being grabbed to the crawler backstage stored from the database according to unified format
The policies and regulations selected in the webpage taken are edited, and obtain editable text, and the editable text is uploaded to institute
State database, and inquiry of policy and regulation interface be provided, for the inquiry of policy and regulation interface search and show described in can compile
Collect text.
7. crawler system according to claim 6, which is characterized in that the management backstage, further includes:
Risk report management subsystem is used for typing risk report, and provides risk report search and show interface, in the wind
Dangerous search report shows that interface receives risk report searching request, and responds the risk report searching request, in the data
Risk report is inquired in library, and is shown, the risk report is to carry out risk assessment to business based on the policies and regulations
Report.
8. crawler system according to claim 7, which is characterized in that the management backstage, further includes:
Risk entry management subsystem is used for typing risk entry, and provides risk vocabulary entry search and show interface, in the wind
Dangerous vocabulary entry search shows that interface receives the request of risk vocabulary entry search, and responds the risk vocabulary entry search request, in the data
Risk entry is searched in library, and shows risk entry, and the risk entry is based on the policies and regulations, from the crawler backstage
The entry with business risk extracted in the webpage of crawl.
9. crawler system according to claim 8, which is characterized in that the management backstage, further includes:
The report of crawler result shows subsystem, after showing the classification parsing result stored in the database and the crawler
The webpage of platform crawl.
10. a kind of crawler method characterized by comprising
According to preset crawl target, webpage is grabbed from network;
Classification parsing is carried out to the webpage of crawler backstage crawl, obtains classification parsing result, and the classification is parsed and is tied
The webpage of fruit and the crawl of crawler backstage is stored in database;
The webpage of the classification parsing result stored in the database and the crawl of crawler backstage is managed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910807818.7A CN110516135A (en) | 2019-08-29 | 2019-08-29 | A kind of crawler system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910807818.7A CN110516135A (en) | 2019-08-29 | 2019-08-29 | A kind of crawler system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110516135A true CN110516135A (en) | 2019-11-29 |
Family
ID=68627865
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910807818.7A Pending CN110516135A (en) | 2019-08-29 | 2019-08-29 | A kind of crawler system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110516135A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113094382A (en) * | 2021-04-02 | 2021-07-09 | 南开大学 | Semi-automatic data acquisition and updating method for multi-source data management |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102339292A (en) * | 2010-07-27 | 2012-02-01 | 中国电信股份有限公司 | Distributed searching method and system |
CN103488750A (en) * | 2013-09-24 | 2014-01-01 | 长沙裕邦软件开发有限公司 | Implementation method and system of network robot |
CN105608134A (en) * | 2015-12-18 | 2016-05-25 | 盐城工学院 | Multithreading-based web crawler system and web crawling method thereof |
CN109063144A (en) * | 2018-08-07 | 2018-12-21 | 广州金猫信息技术服务有限公司 | Visual network crawler method and device |
-
2019
- 2019-08-29 CN CN201910807818.7A patent/CN110516135A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102339292A (en) * | 2010-07-27 | 2012-02-01 | 中国电信股份有限公司 | Distributed searching method and system |
CN103488750A (en) * | 2013-09-24 | 2014-01-01 | 长沙裕邦软件开发有限公司 | Implementation method and system of network robot |
CN105608134A (en) * | 2015-12-18 | 2016-05-25 | 盐城工学院 | Multithreading-based web crawler system and web crawling method thereof |
CN109063144A (en) * | 2018-08-07 | 2018-12-21 | 广州金猫信息技术服务有限公司 | Visual network crawler method and device |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113094382A (en) * | 2021-04-02 | 2021-07-09 | 南开大学 | Semi-automatic data acquisition and updating method for multi-source data management |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6983320B1 (en) | System, method and computer program product for analyzing e-commerce competition of an entity by utilizing predetermined entity-specific metrics and analyzed statistics from web pages | |
US8321396B2 (en) | Automatically extracting by-line information | |
US8095554B1 (en) | Global inventory warehouse | |
US6199081B1 (en) | Automatic tagging of documents and exclusion by content | |
US8463824B2 (en) | Ecosystem method of aggregation and search and related techniques | |
US20030204485A1 (en) | Gathering change data from information provider network indicative of event changes at information provider node relative to pre-stored information in the database by information collection agents | |
DE202015009301U1 (en) | Automatic crawling of applications | |
US20080201318A1 (en) | Method and system for retrieving network documents | |
US20030110106A1 (en) | System and method for enabling content providers in a financial services organization to self-publish content | |
CN1540552A (en) | Computer search with correlation | |
Zhang et al. | Developing a dark web collection and infrastructure for computational and social sciences | |
CN101484892A (en) | A method of managing web services using integrated document | |
US8484217B1 (en) | Knowledge discovery appliance | |
CN103716394B (en) | Download the management method and device of file | |
CN107784113A (en) | Html web page collecting method, device and computer-readable recording medium | |
US20220215065A1 (en) | Intelligent browser bookmark management | |
US20070022110A1 (en) | Method for processing information, apparatus therefor and program therefor | |
CN110516135A (en) | A kind of crawler system and method | |
US20020169792A1 (en) | Method and system for archiving data within a predetermined time interval | |
CN109871476A (en) | A kind of system automatically generating contact person's address list | |
Kitamoto | Digital typhoon: Near real-time aggregation, recombination and delivery of typhoon-related information | |
Porkodi et al. | An improved association rule mining technique for xml data using xquery and apriori algorithm | |
CN111859867B (en) | Web data extraction system based on XML and XPath and use method thereof | |
Englefield et al. | Spatial fire management system | |
JP2005327297A (en) | Knowledge information collecting system and knowledge information collecting method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191129 |
|
RJ01 | Rejection of invention patent application after publication |