CN109409091A - Detect method, apparatus, equipment and the computer storage medium of Web page - Google Patents
Detect method, apparatus, equipment and the computer storage medium of Web page Download PDFInfo
- Publication number
- CN109409091A CN109409091A CN201811142835.5A CN201811142835A CN109409091A CN 109409091 A CN109409091 A CN 109409091A CN 201811142835 A CN201811142835 A CN 201811142835A CN 109409091 A CN109409091 A CN 109409091A
- Authority
- CN
- China
- Prior art keywords
- web page
- data
- label
- sensitive
- keys word
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
- G06F21/577—Assessing vulnerabilities and evaluating computer system security
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/60—Protecting data
- G06F21/62—Protecting access to data via a platform, e.g. using keys or access control rules
- G06F21/6218—Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
- G06F21/6245—Protecting personal data, e.g. for financial or medical purposes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/02—Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/03—Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
- G06F2221/034—Test or assess a computer or a system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2221/00—Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/21—Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F2221/2119—Authenticating web pages, e.g. with suspicious links
Abstract
The invention discloses a kind of method, apparatus, equipment and computer readable storage mediums for detecting Web page, comprising: identifies Web page data information to be detected, calculates the data read of the Web page;Judge whether the data read is more than or equal to preset data degree threshold value;If the data read is less than the preset data degree threshold value, determining the Web page not is the sensitive data page.The detection efficiency of Web page and the confidence level of testing result can be improved in method, apparatus, equipment and computer readable storage medium provided by the present invention.
Description
Technical field
The present invention relates to technical field of network security, more particularly to a kind of method, apparatus, equipment for detecting Web page
And computer readable storage medium.
Background technique
With the development of internet, global IT application degree is higher and higher, and the following problem of data safety is got over
Come more serious.Leaking data includes personal information, company information and state secret, and hacker is obtained large quantities of quick using server loophole
Sense data are used or are flow in black production, and then threaten personal property or national security.
Sensitive data mainly is identified using the method for keyword dictionary pattern matching combination manpower intervention confirmation in the prior art.
Keyword dictionary matching process mainly matches network flow by customized keyword, if network data and key
The success of word dictionary pattern matching then thinks that the network data is sensitive data.
When being identified using keyword dictionary matching process to sensitive data, performance level of the recognition effect to dictionary library
Dependence is high, and the interference of other data is highly susceptible in identification process, causes to judge by accident;Therefore keyword dictionary is utilized
Matching process identifies sensitive data, and the accuracy rate of recognition result is too low.
When by sensitive data in manual identified network data, recognition speed is slow, needs to put into a large amount of manpower, and
There is certain requirement to the Specialized Quality of staff.And when manual identified sensitive data, each staff is to sensitive data
Judgement rely on the subjective judgement consciousness of people, the judgment criteria of each staff is different, will lead to and identifies and tie to sensitive data
The differentiation of fruit.
In summary as can be seen that the efficiency for how improving Web page Data Detection is current problem to be solved.
Summary of the invention
The object of the present invention is to provide a kind of method, apparatus, equipment and computer-readable storage mediums for detecting Web page
Matter, it is lower to the recognition efficiency of the sensitive data of Web page in the prior art to solve the problems, such as.
In order to solve the above technical problems, the present invention provides a kind of method for detecting Web page, comprising: identification is to be detected
Web page data information calculates the data read of the Web page;Judge whether the data read is more than or equal to preset data degree
Threshold value;If the data read is less than the preset data degree threshold value, determining the Web page not is the sensitive data page.
It is preferably, described to judge whether the data read is more than or equal to preset data degree threshold value further include:
If the data read is more than or equal to the preset data degree threshold value, according to the sensitive keys word dictionary constructed in advance
The sensitive keys word in the Web page face data is searched in library;
According to accounting of the sensitive keys word in the Web page face data, the susceptibility of the Web page is calculated;
Judge whether the susceptibility is more than or equal to default susceptibility threshold;
If the susceptibility is more than or equal to the default susceptibility threshold, determine the Web page for sensitive data page
Face.
Preferably, the sensitive keys word dictionary library that the foundation constructs in advance, searches the sensitivity in the Web page face data
Before keyword further include:
The sensitive keys morphology of various industries is acquired into sensitive data collection, is the sensitive data collection according to word frequency and semanteme
In each sensitive keys word distribute weight;
After the sensitive keys word that weight is greater than default weight threshold is classified, constructs sensitive keys word dictionary library and be
Semantic scene is arranged in each sensitive keys word in the sensitive keys word dictionary library.
Preferably, the sensitive keys word dictionary library that the foundation constructs in advance, searches the sensitivity in the Web page face data
Keyword includes:
The Web page face data is split, multiple serial datas are obtained, according to the sensitive keys word dictionary constructed in advance
Library carries out Keywords matching to the multiple serial data respectively;
Semantic scene mistake is carried out to the serial data of the sensitive keys word successful match in the sensitive keys word dictionary library
Filter;
The serial data that semantic scene is not met in the serial data of the successful match is filtered, by serial data remaining after filtering
As the sensitive keys word in the Web page face data.
Preferably, the serial data that semantic scene is not met in the serial data of the filtering successful match, after filtering
Remaining serial data includes: as the sensitive keys word in the Web page face data
Analyze the location information of the serial data of the successful match, length information and symbolic feature, judge it is described matching at
Whether the serial data of function meets semantic scene;
Delete after not meeting the serial data of semantic scene in the serial data of the successful match, using remaining serial data as
Sensitive keys word in the Web page face data.
Preferably, the identification Web page data information to be detected, the data read for calculating the Web page include:
According to the form tag of the Web page and the nest relation of the form tag, the number of the Web page is calculated
According to degree.
Preferably, described according to the form tag of the Web page and the nest relation of the form tag, described in calculating
The data read of Web page includes:
Utilize the table label of the Web page, thead label, tbody label, tr label, th label and td label
Between nest relation, the data volume size of the Web page is counted, to calculate the data read of the Web page.
Preferably, described to be marked using the table label of the Web page, thead label, tbody label, tr label, th
Nest relation between label and td label, the data volume size for counting the Web page include:
Utilize the table label of the Web page, thead label, tbody label, tr label, th label and td label
Between nest relation, restore the data distribution of data entity, distracter and outlier in the Web page;
After the data volume for calculating separately the every row of Web page table, statistics calculates the data volume of the Web page table.
Preferably, the identification Web page data information to be detected, is also wrapped before calculating the data read of the Web page
It includes:
Using the web page address and page feature of Web page to be detected, judge that there are sensitive datas for the Web page
Probability.
Preferably, the web page address and page feature using Web page to be detected, judges that the Web page is deposited
Include: in the probability of sensitive data
If the Web page is static page or the Web page includes search engine label characteristics, described in judgement
Probability in Web page there are sensitive data is zero.
The present invention also provides a kind of devices for detecting Web page, comprising:
Identification module, Web page data information to be detected, calculates the data read of the Web page for identification;
Judgment module, for judging whether the data read is more than or equal to preset data degree threshold value;
Determination module determines the Web to be detected if being less than the preset data degree threshold value for the data read
The page is not the sensitive data page.
The present invention also provides a kind of equipment for detecting Web page, comprising:
Memory, for storing computer program;Processor realizes above-mentioned one kind when for executing the computer program
The step of detecting the method for Web page.
The present invention also provides a kind of computer readable storage medium, meter is stored on the computer readable storage medium
The step of calculation machine program, the computer program realizes a kind of above-mentioned method for detecting Web page when being executed by processor.
The method of detection Web page provided by the present invention, using Web page data information to be detected, described in calculating
The data read of Web page.When the data read of the Web page to be detected is smaller, the Web page data volume is less, significantly
Reduce a possibility that Web page is the sensitive data page.Therefore, using preset data degree threshold value to the Web page
Data read judged;When the data read of the Web page is less than the preset data degree threshold value, then the Web is judged
The page is not sensitive page data.The method of detection Web page provided by the present invention, introduces the data read conduct of Web page
The parameter of measurement of the sensitive data page, and preset data degree threshold value is provided with as unified measurement standard;It can quickly detect
It is not out the Web page of the sensitive data page.Using method provided by the present invention, solves and detect Web page in the prior art
When whether face is the sensitive data page, needs to rely on a large amount of staff with profession basis and complete to cause to Web page
Detection efficiency is lower;And since the judgment criteria of each staff's subjectivity is different, lead to the detection knot to same Web page
The problem of fruit differentiation;Improve the detection efficiency of Web page and the confidence level of testing result.
Detailed description of the invention
It, below will be to embodiment or existing for the clearer technical solution for illustrating the embodiment of the present invention or the prior art
Attached drawing needed in technical description is briefly described, it should be apparent that, the accompanying drawings in the following description is only this hair
Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root
Other attached drawings are obtained according to these attached drawings.
Fig. 1 is the flow chart of the first specific embodiment of the method for detection Web page provided by the present invention;
Fig. 2 is the flow chart of second of specific embodiment of the method for detection Web page provided by the present invention;
Fig. 3 is the flow chart of the third specific embodiment of the method for detection Web page provided by the present invention;
Fig. 4 is a kind of structural block diagram of device for detecting Web page provided in an embodiment of the present invention;
Fig. 5 is a kind of equipment structure chart for detecting Web page provided in an embodiment of the present invention.
Specific embodiment
Core of the invention is to provide a kind of method, apparatus, equipment and computer-readable storage medium for detecting Web page
Matter improves the detection efficiency of Web page.
In order to enable those skilled in the art to better understand the solution of the present invention, with reference to the accompanying drawings and detailed description
The present invention is described in further detail.Obviously, described embodiments are only a part of the embodiments of the present invention, rather than
Whole embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise
Under every other embodiment obtained, shall fall within the protection scope of the present invention.
Referring to FIG. 1, Fig. 1 is the stream of the first specific embodiment of the method for detection Web page provided by the present invention
Cheng Tu;Specific steps are as follows:
Step S101: it identifies Web page data information to be detected, calculates the data read of the Web page;
It in the present embodiment, can be by the form tag of the statistics Web page, according to the nesting of the form tag
Relationship restores the data distribution of the Web page, the result of statistics is matched to the page of data model defined, and then calculate
The data read of the Web page.
The form tag of the Web page includes: table label, thead label, tbody label, tr label, th label
With td label.It can be by embedding between table label, thead label, tbody label, tr label, th label and td label
Set relationship counts the data volume size of the Web page.The nest relation includes the positional relationship of above-mentioned six kinds of labels;Such as:
Judge that whether there is or not thead label or tbody label, thead label and tbody labels to be respectively intended to characterization table in table label
Gauge outfit content and list data part, and tr label, th label and td label are each data entities of description.
The nest relation of form tag in the present embodiment includes some common nest relations, such as thead label and
The positional relationship of three kinds of tr labels, th label and td label in tbody label;There are also some special nest relations, such as: td mark
Label do not have the various special nest relations such as th label comprising table label, only td label.
According to the nest relation of above-mentioned form tag, the data distribution of the Web page can be restored, is counted
Which belongs to data entity in the Web page face data, which is interference option, which is outlier;And with the table
A line is that unit calculates separately out in the table after the data volume of every row, then comprehensive statistics calculates the data volume of entire table.
It should be noted that the above-mentioned page of data model using restoring data distribution situation is suitable for form types data
Degree calculates.By taking finance submit an expense account system sensitive page of data as an example, susceptibility includes the related letter such as " amount of money ", " phone ", " odd numbers "
Breath, and data read is then corresponding data value in each list.
Step S102: judge whether the data read is more than or equal to preset data degree threshold value;
Step S103: if the data read is less than the preset data degree threshold value, it is sensitive for determining the Web page not
Page of data.
In the present embodiment, using the nest relation between the form tag of the Web page, the Web page is restored
Middle data entity, unrelated option and the distribution situation for interfering option, the data cases of the overall recognition page, in the form of data read
As the foundation of detection sensitivity, a large amount of human resources are saved, improve the detection efficiency to the Web page.And it sets
Preset data degree threshold value has been set as unified measurement standard;Improve the credibility to the Web page testing result.
Based on above-mentioned the present embodiment, in this example, if the data read of the Web page is more than or equal to the preset data
Threshold value is spent, then calculates the susceptibility of the Web page and the susceptibility is judged, therefore, it is determined that the Web page is
No is the sensitive data page.
Referring to FIG. 2, Fig. 2 is the stream of second of specific embodiment of the method for detection Web page provided by the present invention
Cheng Tu;Specific steps are as follows:
Step S201: according to the form tag of Web page to be detected and the nest relation of the form tag, institute is calculated
State the data read of Web page;
Step S202: judge whether the data read is more than or equal to preset data degree threshold value;
Step S203: if the data read is more than or equal to the preset data degree threshold value, according to the sensitivity constructed in advance
Keyword dictionary library searches the sensitive keys word in the Web page face data;
According to the sensitive word dictionary library constructed in advance, further include before searching the sensitive keys word in the Web page face data;
Sensitive keys morphology present in existing various industries is collected into sensitive data collection, to the sensitive data collection carry out screening and
Training;It is that each sensitive keys word distributes weight according to word frequency and semanteme;Weight is greater than to the sensitive keys of default weight threshold
After word is classified, the sensitive keys word dictionary library is constructed.
According to the sensitive keys word dictionary library, when searching the sensitive data in Web page face data to be detected, can first by
The Web page face data is divided into multiple serial datas, respectively will be in the multiple serial data and the sensitive keys word dictionary library
Sensitive keys word matched.Semantic scene filtering is carried out to the serial data of successful match;According to the number of the successful match
According to the location information, length information and symbolic feature of string, judge whether the serial data of the successful match meets the sensitive pass
The semantic scene of sensitive keys word is corresponded in keyword library.The serial data that semantic scene is not met described in deletion, by remaining data
The sensitive keys word gone here and there as the Web page.
Step S204: according to accounting of the sensitive keys word in the Web page face data, the Web page is calculated
Susceptibility;
Step S205: judge whether the susceptibility is more than or equal to default susceptibility threshold;
Step S206: if the susceptibility is more than or equal to the default susceptibility threshold, determine that the Web page is quick
Feel page of data.
The method of detection Web page provided by the present embodiment is preset when the data read of Web page to be detected is more than or equal to
When data read threshold value, the susceptibility of the Web page is judged.The Web page face data is divided into multiple serial datas, benefit
Web page face data is matched with the sensitive keys word dictionary library pre-established, to the sensitive keys word dictionary library
Semantic scene filtering is carried out with the serial data after success, sensitive keys word semantic field in the sensitive keys word dictionary library will be met
Sensitive keys word of the serial data of scape as the Web page, to calculate the susceptibility of the Web page.Relative to existing
In technology by whether being sensitive page with keyword dictionary matching judgment Webpage, the present embodiment is more accurately searched
To the sensitive keys word of the Web page, to keep the susceptibility of the Web page more accurate.The present embodiment is in Web page
On the basis of calculated result is more accurate, detect whether the Web page is sensitive data page using susceptibility and data read
Face improves the accuracy of testing result.
Based on the above embodiment, in the present embodiment, can the susceptibility first to Web page to be detected judge;
When the susceptibility of the Web page is more than or equal to default susceptibility threshold, then the data read of the Web page is sentenced
It is disconnected, to judge whether the Web page is the sensitive data page.And susceptibility or data read meter are carried out to the Web page
Before calculation, it according to the web page address and page feature using Web page to be detected can exclude that sensitive data is obviously not present
The page, so as to improve detection efficiency, increase Detection accuracy.
Referring to FIG. 3, Fig. 3 is the stream of the third specific embodiment of the method for detection Web page provided by the present invention
Cheng Tu;Specific steps are as follows:
Step S301: judge whether Web page to be detected there are the probability of sensitive data is zero;
In the present embodiment, web page address and page feature be can use, prefilter are carried out to Web page to be detected,
The Web page that sensitive data is obviously not present is excluded, to improve detection efficiency and detection accuracy.Such as: the Web page
It for static page or include search engine label characteristics, then the Web page can not be the sensitive data page.
When the web page address of the Web is certain static pages, such as HTML, then there is sensitive number in the Web page
According to probability be zero, therefore do not need to the Web page carry out sensitive data detection.
When the Web page includes certain search engine label characteristics, such as keyword;Not to the Web page into
The detection of row sensitive data.
Step S302: if there are the probabilities of sensitive data to be not zero for the Web page, according to the sensitive keys constructed in advance
Word dictionary library searches the sensitive keys word in Web page face data to be detected;
Step S303: filtering the sensitive keys word that semantic scene is not met in the sensitive keys word, will be remaining after filtering
Sensitive keys word as the sensitive keys word obtained in the Web page;
The behaviour of the extraction of matched data source, keyword patterns matching and semantic scene analysis is carried out to the Web page face data
After work, the susceptibility of the Web page is calculated.
The Web page face data is carried out in matching source extraction operation i.e. above-described embodiment, to the Web page face data into
Row segmentation obtains multiple serial datas, that is, obtains multiple matching sources.Keyword patterns matching is i.e. according to the sensitive keys word dictionary
Library carries out Keywords matching to multiple serial datas, after the serial data progress semantic scene analysis for obtaining successful match, obtains described
The sensitive keys word of Web page.
Semantic scene filtering, including keyword position, coupling number are carried out to the serial data of successful match in the Web page
It is made whether to meet the analysis of Chinese semanteme according to length and special symbol and feature number etc..Such as: if the serial data extracted
For " amount of money: 1234 yuan ", semantic scene defined in sensitive keys word " amount of money " is met, and serial data " Ying Fajin
Volume, which is how many, " does not meet semantic scene then.
Step S304: according to accounting of the sensitive keys word of the Web page in the Web page face data, institute is calculated
State the susceptibility of Web page;
Step S305: judge whether the susceptibility is more than or equal to default susceptibility threshold;
Step S306: if the susceptibility is more than or equal to the default susceptibility threshold, the table according to the Web page
The nest relation of label and the form tag calculates the data read of the Web page;
Step S307: judge whether the data read is more than or equal to preset data degree threshold value;
Step S308: if the data read is more than or equal to the preset data degree threshold value, determine that the Web page is quick
Feel page of data.
In the present embodiment, first according to page features such as the page network address of Web page to be detected and the page-tags,
Prefilter is carried out to the Web page.It excludes after the Web page of sensitive data is obviously not present, it is quick according to what is pre-established
Sense keyword dictionary library carries out matching to the Web page face data and semantic scene is analyzed, so that the Web page be calculated
Susceptibility.When the susceptibility is more than or equal to the default susceptibility threshold, using the nest relation between form tag,
Restore data entity in the Web page, unrelated option and the distribution situation for interfering option, the data feelings of the overall recognition page
Condition, as the foundation of detection sensitivity in the form of data volume.According to the obtained matching of Chinese semantic analysis in the present embodiment
Information and page of data identify obtained information, the i.e. susceptibility of Web page and the size of data read and weight;And foundation
The susceptibility of Web page and the size of data read and weight with whether being that the sensitive data page judges to the Web page,
Human resources are greatly saved, improve the efficiency of Web page detection and the accuracy of testing result.
Referring to FIG. 4, Fig. 4 is a kind of structural block diagram of device for detecting Web page provided in an embodiment of the present invention;Specifically
Device may include:
Identification module 100, Web page data information to be detected, calculates the data read of the Web page for identification;
Judgment module 200, for judging whether the data read is more than or equal to preset data degree threshold value;
Determination module 300 determines described to be detected if being less than the preset data degree threshold value for the data read
Web page is not the sensitive data page.
The device of the detection Web page of the present embodiment detects for realizing the method above-mentioned for detecting Web page
The embodiment part of the method for the visible detection Web page hereinbefore of specific embodiment in the device of Web page, for example,
Identification module 100, judgment module 200, determination module 300 are respectively used to step in the method for realizing above-mentioned detection Web page
S101, S102 and S103, so, specific embodiment is referred to the description of corresponding various pieces embodiment, herein not
It repeats again.
Referring to FIG. 5, Fig. 5 is a kind of equipment structure chart for detecting Web page provided in an embodiment of the present invention;Specific equipment
May include:
Memory 100, for storing computer program;
Processor 200 realizes the step of a kind of above-mentioned method for detecting Web page when for executing the computer program
Suddenly.
The specific embodiment of the invention additionally provides a kind of computer readable storage medium, the computer readable storage medium
On be stored with computer program, the computer program realizes a kind of above-mentioned method for detecting Web page when being executed by processor
The step of.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with it is other
The difference of embodiment, same or similar part may refer to each other between each embodiment.For being filled disclosed in embodiment
For setting, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is referring to method part
Explanation.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure
And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and
The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These
Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession
Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered
Think beyond the scope of this invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor
The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
Above to method, apparatus, equipment and the computer readable storage medium of detection Web page provided by the present invention
It is described in detail.Used herein a specific example illustrates the principle and implementation of the invention, the above reality
The explanation for applying example is merely used to help understand method and its core concept of the invention.It should be pointed out that for the art
For those of ordinary skill, without departing from the principle of the present invention, can with several improvements and modifications are made to the present invention,
These improvements and modifications also fall within the scope of protection of the claims of the present invention.
Claims (13)
1. a kind of method for detecting Web page characterized by comprising
It identifies Web page data information to be detected, calculates the data read of the Web page;
Judge whether the data read is more than or equal to preset data degree threshold value;
If the data read is less than the preset data degree threshold value, determining the Web page not is the sensitive data page.
2. method according to claim 2, which is characterized in that described to judge whether the data read is greater than preset data degree threshold
Value further include:
If the data read is more than or equal to the preset data degree threshold value, according to the sensitive keys word dictionary library that constructs in advance,
Search the sensitive keys word in the Web page face data;
According to accounting of the sensitive keys word in the Web page face data, the susceptibility of the Web page is calculated;
Judge whether the susceptibility is more than or equal to default susceptibility threshold;
If the susceptibility is more than or equal to the default susceptibility threshold, determine the Web page for the sensitive data page.
3. method according to claim 2, which is characterized in that the sensitive keys word dictionary library that the foundation constructs in advance is looked into
Before looking for the sensitive keys word in the Web page face data further include:
The sensitive keys morphology of various industries is acquired into sensitive data collection, is what the sensitive data was concentrated according to word frequency and semanteme
Each sensitive keys word distributes weight;
After the sensitive keys word that weight is greater than default weight threshold is classified, constructs sensitive keys word dictionary library and be described
Semantic scene is arranged in each sensitive keys word in sensitive keys word dictionary library.
4. method as claimed in claim 3, which is characterized in that the sensitive keys word dictionary library that the foundation constructs in advance is looked into
The sensitive keys word in the Web page face data is looked for include:
The Web page face data is split, multiple serial datas are obtained, according to the sensitive keys word dictionary library point constructed in advance
It is other that Keywords matching is carried out to the multiple serial data;
Semantic scene filtering is carried out to the serial data of the sensitive keys word successful match in the sensitive keys word dictionary library;
Filter the serial data that semantic scene is not met in the serial data of the successful match, using serial data remaining after filtering as
Sensitive keys word in the Web page face data.
5. method as claimed in claim 4, which is characterized in that do not meet language in the serial data of the filtering successful match
The serial data of adopted scene includes: using serial data remaining after filtering as the sensitive keys word in the Web page face data
The location information of the serial data of the successful match, length information and symbolic feature are analyzed, judges the successful match
Whether serial data meets semantic scene;
It deletes after not meeting the serial data of semantic scene in the serial data of the successful match, using remaining serial data as described in
Sensitive keys word in Web page face data.
6. the method as described in claim 1, which is characterized in that the identification Web page data information to be detected calculates institute
The data read for stating Web page includes:
According to the form tag of the Web page and the nest relation of the form tag, the data of the Web page are calculated
Degree.
7. method as claimed in claim 6, which is characterized in that the form tag and the table according to the Web page
The nest relation of case marker label, the data read for calculating the Web page include:
Using between the table label of the Web page, thead label, tbody label, tr label, th label and td label
Nest relation, the data volume size of the Web page is counted, to calculate the data read of the Web page.
8. the method stated such as claim 7, which is characterized in that described to be marked using table label, the thead of the Web page
Nest relation between label, tbody label, tr label, th label and td label, counts the data volume size of the Web page
Include:
Using between the table label of the Web page, thead label, tbody label, tr label, th label and td label
Nest relation, restore the data distribution of data entity, distracter and outlier in the Web page;
After the data volume for calculating separately the every row of Web page table, statistics calculates the data volume of the Web page table.
9. the method according to claim 1, which is characterized in that the identification Web page face data letter to be detected
It ceases, before the data read for calculating the Web page further include:
Using the web page address and page feature of Web page to be detected, judge that there are the several of sensitive data for the Web page
Rate.
10. method as claimed in claim 9, which is characterized in that the web page address and page using Web page to be detected
Region feature, judge the Web page there are the probabilities of sensitive data include:
If the Web page is static page or the Web page includes search engine label characteristics, the Web is determined
Probability in the page there are sensitive data is zero.
11. a kind of device for detecting Web page characterized by comprising
Identification module, Web page data information to be detected, calculates the data read of the Web page for identification;
Judgment module, for judging whether the data read is more than or equal to preset data degree threshold value;
Determination module determines the Web page to be detected if being less than the preset data degree threshold value for the data read
It is not the sensitive data page.
12. a kind of equipment for detecting Web page characterized by comprising
Memory, for storing computer program;
Processor realizes a kind of detection Web page as described in any one of claims 1 to 10 when for executing the computer program
The step of method in face.
13. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium
Program realizes a kind of detection Web page as described in any one of claims 1 to 10 when the computer program is executed by processor
Method the step of.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811142835.5A CN109409091B (en) | 2018-09-28 | 2018-09-28 | Method, device and equipment for detecting Web page and computer storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811142835.5A CN109409091B (en) | 2018-09-28 | 2018-09-28 | Method, device and equipment for detecting Web page and computer storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109409091A true CN109409091A (en) | 2019-03-01 |
CN109409091B CN109409091B (en) | 2021-11-19 |
Family
ID=65466457
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811142835.5A Active CN109409091B (en) | 2018-09-28 | 2018-09-28 | Method, device and equipment for detecting Web page and computer storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109409091B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115222567A (en) * | 2022-09-20 | 2022-10-21 | 北京润尼尔网络科技有限公司 | VR resource and equipment management system |
CN117390657A (en) * | 2023-12-12 | 2024-01-12 | 深圳竹云科技股份有限公司 | Data encryption method, device, computer equipment and storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105022803A (en) * | 2015-07-01 | 2015-11-04 | 广州市万隆证券咨询顾问有限公司 | Method and system for extracting text content of webpage |
CN106599155A (en) * | 2016-12-07 | 2017-04-26 | 北京亚鸿世纪科技发展有限公司 | Method and system for classifying web pages |
CN106776946A (en) * | 2016-12-02 | 2017-05-31 | 重庆大学 | A kind of detection method of fraudulent website |
CN107943954A (en) * | 2017-11-24 | 2018-04-20 | 杭州安恒信息技术有限公司 | Detection method, device and the electronic equipment of webpage sensitive information |
US20180191735A1 (en) * | 2015-03-20 | 2018-07-05 | Excalibur Ip, Llc | Secure Service for Receiving Sensitive Information through Nested iframes |
CN108304584A (en) * | 2018-03-06 | 2018-07-20 | 百度在线网络技术(北京)有限公司 | Illegal page detection method, apparatus, intruding detection system and storage medium |
-
2018
- 2018-09-28 CN CN201811142835.5A patent/CN109409091B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20180191735A1 (en) * | 2015-03-20 | 2018-07-05 | Excalibur Ip, Llc | Secure Service for Receiving Sensitive Information through Nested iframes |
CN105022803A (en) * | 2015-07-01 | 2015-11-04 | 广州市万隆证券咨询顾问有限公司 | Method and system for extracting text content of webpage |
CN106776946A (en) * | 2016-12-02 | 2017-05-31 | 重庆大学 | A kind of detection method of fraudulent website |
CN106599155A (en) * | 2016-12-07 | 2017-04-26 | 北京亚鸿世纪科技发展有限公司 | Method and system for classifying web pages |
CN107943954A (en) * | 2017-11-24 | 2018-04-20 | 杭州安恒信息技术有限公司 | Detection method, device and the electronic equipment of webpage sensitive information |
CN108304584A (en) * | 2018-03-06 | 2018-07-20 | 百度在线网络技术(北京)有限公司 | Illegal page detection method, apparatus, intruding detection system and storage medium |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115222567A (en) * | 2022-09-20 | 2022-10-21 | 北京润尼尔网络科技有限公司 | VR resource and equipment management system |
CN117390657A (en) * | 2023-12-12 | 2024-01-12 | 深圳竹云科技股份有限公司 | Data encryption method, device, computer equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109409091B (en) | 2021-11-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107437038B (en) | Webpage tampering detection method and device | |
WO2019218514A1 (en) | Method for extracting webpage target information, device, and storage medium | |
CN103544255B (en) | Text semantic relativity based network public opinion information analysis method | |
CN103914494B (en) | Method and system for identifying identity of microblog user | |
CN110263248A (en) | A kind of information-pushing method, device, storage medium and server | |
CN109194677A (en) | A kind of SQL injection attack detection, device and equipment | |
CN104573711B (en) | The image understanding method of object and scene based on text objects scene relation | |
CN108021651A (en) | Network public opinion risk assessment method and device | |
CN111241389A (en) | Sensitive word filtering method and device based on matrix, electronic equipment and storage medium | |
CN110197389A (en) | A kind of user identification method and device | |
CN109299085A (en) | A kind of data processing method, electronic equipment and storage medium | |
CN110516034A (en) | Blog management method, device, the network equipment and readable storage medium storing program for executing | |
CN109165529A (en) | A kind of dark chain altering detecting method, device and computer readable storage medium | |
CN111177367A (en) | Case classification method, classification model training method and related products | |
CN107239694A (en) | A kind of Android application permissions inference method and device based on user comment | |
CN107688563B (en) | Synonym recognition method and recognition device | |
CN109299469A (en) | A method of identifying complicated address in long text | |
CN106301979B (en) | Method and system for detecting abnormal channel | |
CN113779481A (en) | Method, device, equipment and storage medium for identifying fraud websites | |
CN109409091A (en) | Detect method, apparatus, equipment and the computer storage medium of Web page | |
CN106168968A (en) | A kind of Website classification method and device | |
CN104036190A (en) | Method and device for detecting page tampering | |
CN114692593A (en) | Network information safety monitoring and early warning method | |
CN110189170A (en) | Market sentiment analysis method and system | |
CN104036189A (en) | Page distortion detecting method and black link database generating method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |