CN109409091A - Detect method, apparatus, equipment and the computer storage medium of Web page - Google Patents

Detect method, apparatus, equipment and the computer storage medium of Web page Download PDF

Info

Publication number
CN109409091A
CN109409091A CN201811142835.5A CN201811142835A CN109409091A CN 109409091 A CN109409091 A CN 109409091A CN 201811142835 A CN201811142835 A CN 201811142835A CN 109409091 A CN109409091 A CN 109409091A
Authority
CN
China
Prior art keywords
web page
data
label
sensitive
keys word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811142835.5A
Other languages
Chinese (zh)
Other versions
CN109409091B (en
Inventor
林文洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN201811142835.5A priority Critical patent/CN109409091B/en
Publication of CN109409091A publication Critical patent/CN109409091A/en
Application granted granted Critical
Publication of CN109409091B publication Critical patent/CN109409091B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/034Test or assess a computer or a system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2119Authenticating web pages, e.g. with suspicious links

Abstract

The invention discloses a kind of method, apparatus, equipment and computer readable storage mediums for detecting Web page, comprising: identifies Web page data information to be detected, calculates the data read of the Web page;Judge whether the data read is more than or equal to preset data degree threshold value;If the data read is less than the preset data degree threshold value, determining the Web page not is the sensitive data page.The detection efficiency of Web page and the confidence level of testing result can be improved in method, apparatus, equipment and computer readable storage medium provided by the present invention.

Description

Detect method, apparatus, equipment and the computer storage medium of Web page
Technical field
The present invention relates to technical field of network security, more particularly to a kind of method, apparatus, equipment for detecting Web page And computer readable storage medium.
Background technique
With the development of internet, global IT application degree is higher and higher, and the following problem of data safety is got over Come more serious.Leaking data includes personal information, company information and state secret, and hacker is obtained large quantities of quick using server loophole Sense data are used or are flow in black production, and then threaten personal property or national security.
Sensitive data mainly is identified using the method for keyword dictionary pattern matching combination manpower intervention confirmation in the prior art. Keyword dictionary matching process mainly matches network flow by customized keyword, if network data and key The success of word dictionary pattern matching then thinks that the network data is sensitive data.
When being identified using keyword dictionary matching process to sensitive data, performance level of the recognition effect to dictionary library Dependence is high, and the interference of other data is highly susceptible in identification process, causes to judge by accident;Therefore keyword dictionary is utilized Matching process identifies sensitive data, and the accuracy rate of recognition result is too low.
When by sensitive data in manual identified network data, recognition speed is slow, needs to put into a large amount of manpower, and There is certain requirement to the Specialized Quality of staff.And when manual identified sensitive data, each staff is to sensitive data Judgement rely on the subjective judgement consciousness of people, the judgment criteria of each staff is different, will lead to and identifies and tie to sensitive data The differentiation of fruit.
In summary as can be seen that the efficiency for how improving Web page Data Detection is current problem to be solved.
Summary of the invention
The object of the present invention is to provide a kind of method, apparatus, equipment and computer-readable storage mediums for detecting Web page Matter, it is lower to the recognition efficiency of the sensitive data of Web page in the prior art to solve the problems, such as.
In order to solve the above technical problems, the present invention provides a kind of method for detecting Web page, comprising: identification is to be detected Web page data information calculates the data read of the Web page;Judge whether the data read is more than or equal to preset data degree Threshold value;If the data read is less than the preset data degree threshold value, determining the Web page not is the sensitive data page.
It is preferably, described to judge whether the data read is more than or equal to preset data degree threshold value further include:
If the data read is more than or equal to the preset data degree threshold value, according to the sensitive keys word dictionary constructed in advance The sensitive keys word in the Web page face data is searched in library;
According to accounting of the sensitive keys word in the Web page face data, the susceptibility of the Web page is calculated;
Judge whether the susceptibility is more than or equal to default susceptibility threshold;
If the susceptibility is more than or equal to the default susceptibility threshold, determine the Web page for sensitive data page Face.
Preferably, the sensitive keys word dictionary library that the foundation constructs in advance, searches the sensitivity in the Web page face data Before keyword further include:
The sensitive keys morphology of various industries is acquired into sensitive data collection, is the sensitive data collection according to word frequency and semanteme In each sensitive keys word distribute weight;
After the sensitive keys word that weight is greater than default weight threshold is classified, constructs sensitive keys word dictionary library and be Semantic scene is arranged in each sensitive keys word in the sensitive keys word dictionary library.
Preferably, the sensitive keys word dictionary library that the foundation constructs in advance, searches the sensitivity in the Web page face data Keyword includes:
The Web page face data is split, multiple serial datas are obtained, according to the sensitive keys word dictionary constructed in advance Library carries out Keywords matching to the multiple serial data respectively;
Semantic scene mistake is carried out to the serial data of the sensitive keys word successful match in the sensitive keys word dictionary library Filter;
The serial data that semantic scene is not met in the serial data of the successful match is filtered, by serial data remaining after filtering As the sensitive keys word in the Web page face data.
Preferably, the serial data that semantic scene is not met in the serial data of the filtering successful match, after filtering Remaining serial data includes: as the sensitive keys word in the Web page face data
Analyze the location information of the serial data of the successful match, length information and symbolic feature, judge it is described matching at Whether the serial data of function meets semantic scene;
Delete after not meeting the serial data of semantic scene in the serial data of the successful match, using remaining serial data as Sensitive keys word in the Web page face data.
Preferably, the identification Web page data information to be detected, the data read for calculating the Web page include:
According to the form tag of the Web page and the nest relation of the form tag, the number of the Web page is calculated According to degree.
Preferably, described according to the form tag of the Web page and the nest relation of the form tag, described in calculating The data read of Web page includes:
Utilize the table label of the Web page, thead label, tbody label, tr label, th label and td label Between nest relation, the data volume size of the Web page is counted, to calculate the data read of the Web page.
Preferably, described to be marked using the table label of the Web page, thead label, tbody label, tr label, th Nest relation between label and td label, the data volume size for counting the Web page include:
Utilize the table label of the Web page, thead label, tbody label, tr label, th label and td label Between nest relation, restore the data distribution of data entity, distracter and outlier in the Web page;
After the data volume for calculating separately the every row of Web page table, statistics calculates the data volume of the Web page table.
Preferably, the identification Web page data information to be detected, is also wrapped before calculating the data read of the Web page It includes:
Using the web page address and page feature of Web page to be detected, judge that there are sensitive datas for the Web page Probability.
Preferably, the web page address and page feature using Web page to be detected, judges that the Web page is deposited Include: in the probability of sensitive data
If the Web page is static page or the Web page includes search engine label characteristics, described in judgement Probability in Web page there are sensitive data is zero.
The present invention also provides a kind of devices for detecting Web page, comprising:
Identification module, Web page data information to be detected, calculates the data read of the Web page for identification;
Judgment module, for judging whether the data read is more than or equal to preset data degree threshold value;
Determination module determines the Web to be detected if being less than the preset data degree threshold value for the data read The page is not the sensitive data page.
The present invention also provides a kind of equipment for detecting Web page, comprising:
Memory, for storing computer program;Processor realizes above-mentioned one kind when for executing the computer program The step of detecting the method for Web page.
The present invention also provides a kind of computer readable storage medium, meter is stored on the computer readable storage medium The step of calculation machine program, the computer program realizes a kind of above-mentioned method for detecting Web page when being executed by processor.
The method of detection Web page provided by the present invention, using Web page data information to be detected, described in calculating The data read of Web page.When the data read of the Web page to be detected is smaller, the Web page data volume is less, significantly Reduce a possibility that Web page is the sensitive data page.Therefore, using preset data degree threshold value to the Web page Data read judged;When the data read of the Web page is less than the preset data degree threshold value, then the Web is judged The page is not sensitive page data.The method of detection Web page provided by the present invention, introduces the data read conduct of Web page The parameter of measurement of the sensitive data page, and preset data degree threshold value is provided with as unified measurement standard;It can quickly detect It is not out the Web page of the sensitive data page.Using method provided by the present invention, solves and detect Web page in the prior art When whether face is the sensitive data page, needs to rely on a large amount of staff with profession basis and complete to cause to Web page Detection efficiency is lower;And since the judgment criteria of each staff's subjectivity is different, lead to the detection knot to same Web page The problem of fruit differentiation;Improve the detection efficiency of Web page and the confidence level of testing result.
Detailed description of the invention
It, below will be to embodiment or existing for the clearer technical solution for illustrating the embodiment of the present invention or the prior art Attached drawing needed in technical description is briefly described, it should be apparent that, the accompanying drawings in the following description is only this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.
Fig. 1 is the flow chart of the first specific embodiment of the method for detection Web page provided by the present invention;
Fig. 2 is the flow chart of second of specific embodiment of the method for detection Web page provided by the present invention;
Fig. 3 is the flow chart of the third specific embodiment of the method for detection Web page provided by the present invention;
Fig. 4 is a kind of structural block diagram of device for detecting Web page provided in an embodiment of the present invention;
Fig. 5 is a kind of equipment structure chart for detecting Web page provided in an embodiment of the present invention.
Specific embodiment
Core of the invention is to provide a kind of method, apparatus, equipment and computer-readable storage medium for detecting Web page Matter improves the detection efficiency of Web page.
In order to enable those skilled in the art to better understand the solution of the present invention, with reference to the accompanying drawings and detailed description The present invention is described in further detail.Obviously, described embodiments are only a part of the embodiments of the present invention, rather than Whole embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art are not making creative work premise Under every other embodiment obtained, shall fall within the protection scope of the present invention.
Referring to FIG. 1, Fig. 1 is the stream of the first specific embodiment of the method for detection Web page provided by the present invention Cheng Tu;Specific steps are as follows:
Step S101: it identifies Web page data information to be detected, calculates the data read of the Web page;
It in the present embodiment, can be by the form tag of the statistics Web page, according to the nesting of the form tag Relationship restores the data distribution of the Web page, the result of statistics is matched to the page of data model defined, and then calculate The data read of the Web page.
The form tag of the Web page includes: table label, thead label, tbody label, tr label, th label With td label.It can be by embedding between table label, thead label, tbody label, tr label, th label and td label Set relationship counts the data volume size of the Web page.The nest relation includes the positional relationship of above-mentioned six kinds of labels;Such as: Judge that whether there is or not thead label or tbody label, thead label and tbody labels to be respectively intended to characterization table in table label Gauge outfit content and list data part, and tr label, th label and td label are each data entities of description.
The nest relation of form tag in the present embodiment includes some common nest relations, such as thead label and The positional relationship of three kinds of tr labels, th label and td label in tbody label;There are also some special nest relations, such as: td mark Label do not have the various special nest relations such as th label comprising table label, only td label.
According to the nest relation of above-mentioned form tag, the data distribution of the Web page can be restored, is counted Which belongs to data entity in the Web page face data, which is interference option, which is outlier;And with the table A line is that unit calculates separately out in the table after the data volume of every row, then comprehensive statistics calculates the data volume of entire table.
It should be noted that the above-mentioned page of data model using restoring data distribution situation is suitable for form types data Degree calculates.By taking finance submit an expense account system sensitive page of data as an example, susceptibility includes the related letter such as " amount of money ", " phone ", " odd numbers " Breath, and data read is then corresponding data value in each list.
Step S102: judge whether the data read is more than or equal to preset data degree threshold value;
Step S103: if the data read is less than the preset data degree threshold value, it is sensitive for determining the Web page not Page of data.
In the present embodiment, using the nest relation between the form tag of the Web page, the Web page is restored Middle data entity, unrelated option and the distribution situation for interfering option, the data cases of the overall recognition page, in the form of data read As the foundation of detection sensitivity, a large amount of human resources are saved, improve the detection efficiency to the Web page.And it sets Preset data degree threshold value has been set as unified measurement standard;Improve the credibility to the Web page testing result.
Based on above-mentioned the present embodiment, in this example, if the data read of the Web page is more than or equal to the preset data Threshold value is spent, then calculates the susceptibility of the Web page and the susceptibility is judged, therefore, it is determined that the Web page is No is the sensitive data page.
Referring to FIG. 2, Fig. 2 is the stream of second of specific embodiment of the method for detection Web page provided by the present invention Cheng Tu;Specific steps are as follows:
Step S201: according to the form tag of Web page to be detected and the nest relation of the form tag, institute is calculated State the data read of Web page;
Step S202: judge whether the data read is more than or equal to preset data degree threshold value;
Step S203: if the data read is more than or equal to the preset data degree threshold value, according to the sensitivity constructed in advance Keyword dictionary library searches the sensitive keys word in the Web page face data;
According to the sensitive word dictionary library constructed in advance, further include before searching the sensitive keys word in the Web page face data; Sensitive keys morphology present in existing various industries is collected into sensitive data collection, to the sensitive data collection carry out screening and Training;It is that each sensitive keys word distributes weight according to word frequency and semanteme;Weight is greater than to the sensitive keys of default weight threshold After word is classified, the sensitive keys word dictionary library is constructed.
According to the sensitive keys word dictionary library, when searching the sensitive data in Web page face data to be detected, can first by The Web page face data is divided into multiple serial datas, respectively will be in the multiple serial data and the sensitive keys word dictionary library Sensitive keys word matched.Semantic scene filtering is carried out to the serial data of successful match;According to the number of the successful match According to the location information, length information and symbolic feature of string, judge whether the serial data of the successful match meets the sensitive pass The semantic scene of sensitive keys word is corresponded in keyword library.The serial data that semantic scene is not met described in deletion, by remaining data The sensitive keys word gone here and there as the Web page.
Step S204: according to accounting of the sensitive keys word in the Web page face data, the Web page is calculated Susceptibility;
Step S205: judge whether the susceptibility is more than or equal to default susceptibility threshold;
Step S206: if the susceptibility is more than or equal to the default susceptibility threshold, determine that the Web page is quick Feel page of data.
The method of detection Web page provided by the present embodiment is preset when the data read of Web page to be detected is more than or equal to When data read threshold value, the susceptibility of the Web page is judged.The Web page face data is divided into multiple serial datas, benefit Web page face data is matched with the sensitive keys word dictionary library pre-established, to the sensitive keys word dictionary library Semantic scene filtering is carried out with the serial data after success, sensitive keys word semantic field in the sensitive keys word dictionary library will be met Sensitive keys word of the serial data of scape as the Web page, to calculate the susceptibility of the Web page.Relative to existing In technology by whether being sensitive page with keyword dictionary matching judgment Webpage, the present embodiment is more accurately searched To the sensitive keys word of the Web page, to keep the susceptibility of the Web page more accurate.The present embodiment is in Web page On the basis of calculated result is more accurate, detect whether the Web page is sensitive data page using susceptibility and data read Face improves the accuracy of testing result.
Based on the above embodiment, in the present embodiment, can the susceptibility first to Web page to be detected judge; When the susceptibility of the Web page is more than or equal to default susceptibility threshold, then the data read of the Web page is sentenced It is disconnected, to judge whether the Web page is the sensitive data page.And susceptibility or data read meter are carried out to the Web page Before calculation, it according to the web page address and page feature using Web page to be detected can exclude that sensitive data is obviously not present The page, so as to improve detection efficiency, increase Detection accuracy.
Referring to FIG. 3, Fig. 3 is the stream of the third specific embodiment of the method for detection Web page provided by the present invention Cheng Tu;Specific steps are as follows:
Step S301: judge whether Web page to be detected there are the probability of sensitive data is zero;
In the present embodiment, web page address and page feature be can use, prefilter are carried out to Web page to be detected, The Web page that sensitive data is obviously not present is excluded, to improve detection efficiency and detection accuracy.Such as: the Web page It for static page or include search engine label characteristics, then the Web page can not be the sensitive data page.
When the web page address of the Web is certain static pages, such as HTML, then there is sensitive number in the Web page According to probability be zero, therefore do not need to the Web page carry out sensitive data detection.
When the Web page includes certain search engine label characteristics, such as keyword;Not to the Web page into The detection of row sensitive data.
Step S302: if there are the probabilities of sensitive data to be not zero for the Web page, according to the sensitive keys constructed in advance Word dictionary library searches the sensitive keys word in Web page face data to be detected;
Step S303: filtering the sensitive keys word that semantic scene is not met in the sensitive keys word, will be remaining after filtering Sensitive keys word as the sensitive keys word obtained in the Web page;
The behaviour of the extraction of matched data source, keyword patterns matching and semantic scene analysis is carried out to the Web page face data After work, the susceptibility of the Web page is calculated.
The Web page face data is carried out in matching source extraction operation i.e. above-described embodiment, to the Web page face data into Row segmentation obtains multiple serial datas, that is, obtains multiple matching sources.Keyword patterns matching is i.e. according to the sensitive keys word dictionary Library carries out Keywords matching to multiple serial datas, after the serial data progress semantic scene analysis for obtaining successful match, obtains described The sensitive keys word of Web page.
Semantic scene filtering, including keyword position, coupling number are carried out to the serial data of successful match in the Web page It is made whether to meet the analysis of Chinese semanteme according to length and special symbol and feature number etc..Such as: if the serial data extracted For " amount of money: 1234 yuan ", semantic scene defined in sensitive keys word " amount of money " is met, and serial data " Ying Fajin Volume, which is how many, " does not meet semantic scene then.
Step S304: according to accounting of the sensitive keys word of the Web page in the Web page face data, institute is calculated State the susceptibility of Web page;
Step S305: judge whether the susceptibility is more than or equal to default susceptibility threshold;
Step S306: if the susceptibility is more than or equal to the default susceptibility threshold, the table according to the Web page The nest relation of label and the form tag calculates the data read of the Web page;
Step S307: judge whether the data read is more than or equal to preset data degree threshold value;
Step S308: if the data read is more than or equal to the preset data degree threshold value, determine that the Web page is quick Feel page of data.
In the present embodiment, first according to page features such as the page network address of Web page to be detected and the page-tags, Prefilter is carried out to the Web page.It excludes after the Web page of sensitive data is obviously not present, it is quick according to what is pre-established Sense keyword dictionary library carries out matching to the Web page face data and semantic scene is analyzed, so that the Web page be calculated Susceptibility.When the susceptibility is more than or equal to the default susceptibility threshold, using the nest relation between form tag, Restore data entity in the Web page, unrelated option and the distribution situation for interfering option, the data feelings of the overall recognition page Condition, as the foundation of detection sensitivity in the form of data volume.According to the obtained matching of Chinese semantic analysis in the present embodiment Information and page of data identify obtained information, the i.e. susceptibility of Web page and the size of data read and weight;And foundation The susceptibility of Web page and the size of data read and weight with whether being that the sensitive data page judges to the Web page, Human resources are greatly saved, improve the efficiency of Web page detection and the accuracy of testing result.
Referring to FIG. 4, Fig. 4 is a kind of structural block diagram of device for detecting Web page provided in an embodiment of the present invention;Specifically Device may include:
Identification module 100, Web page data information to be detected, calculates the data read of the Web page for identification;
Judgment module 200, for judging whether the data read is more than or equal to preset data degree threshold value;
Determination module 300 determines described to be detected if being less than the preset data degree threshold value for the data read Web page is not the sensitive data page.
The device of the detection Web page of the present embodiment detects for realizing the method above-mentioned for detecting Web page The embodiment part of the method for the visible detection Web page hereinbefore of specific embodiment in the device of Web page, for example, Identification module 100, judgment module 200, determination module 300 are respectively used to step in the method for realizing above-mentioned detection Web page S101, S102 and S103, so, specific embodiment is referred to the description of corresponding various pieces embodiment, herein not It repeats again.
Referring to FIG. 5, Fig. 5 is a kind of equipment structure chart for detecting Web page provided in an embodiment of the present invention;Specific equipment May include:
Memory 100, for storing computer program;
Processor 200 realizes the step of a kind of above-mentioned method for detecting Web page when for executing the computer program Suddenly.
The specific embodiment of the invention additionally provides a kind of computer readable storage medium, the computer readable storage medium On be stored with computer program, the computer program realizes a kind of above-mentioned method for detecting Web page when being executed by processor The step of.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with it is other The difference of embodiment, same or similar part may refer to each other between each embodiment.For being filled disclosed in embodiment For setting, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is referring to method part Explanation.
Professional further appreciates that, unit described in conjunction with the examples disclosed in the embodiments of the present disclosure And algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware and The interchangeability of software generally describes each exemplary composition and step according to function in the above description.These Function is implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Profession Technical staff can use different methods to achieve the described function each specific application, but this realization is not answered Think beyond the scope of this invention.
The step of method described in conjunction with the examples disclosed in this document or algorithm, can directly be held with hardware, processor The combination of capable software module or the two is implemented.Software module can be placed in random access memory (RAM), memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
Above to method, apparatus, equipment and the computer readable storage medium of detection Web page provided by the present invention It is described in detail.Used herein a specific example illustrates the principle and implementation of the invention, the above reality The explanation for applying example is merely used to help understand method and its core concept of the invention.It should be pointed out that for the art For those of ordinary skill, without departing from the principle of the present invention, can with several improvements and modifications are made to the present invention, These improvements and modifications also fall within the scope of protection of the claims of the present invention.

Claims (13)

1. a kind of method for detecting Web page characterized by comprising
It identifies Web page data information to be detected, calculates the data read of the Web page;
Judge whether the data read is more than or equal to preset data degree threshold value;
If the data read is less than the preset data degree threshold value, determining the Web page not is the sensitive data page.
2. method according to claim 2, which is characterized in that described to judge whether the data read is greater than preset data degree threshold Value further include:
If the data read is more than or equal to the preset data degree threshold value, according to the sensitive keys word dictionary library that constructs in advance, Search the sensitive keys word in the Web page face data;
According to accounting of the sensitive keys word in the Web page face data, the susceptibility of the Web page is calculated;
Judge whether the susceptibility is more than or equal to default susceptibility threshold;
If the susceptibility is more than or equal to the default susceptibility threshold, determine the Web page for the sensitive data page.
3. method according to claim 2, which is characterized in that the sensitive keys word dictionary library that the foundation constructs in advance is looked into Before looking for the sensitive keys word in the Web page face data further include:
The sensitive keys morphology of various industries is acquired into sensitive data collection, is what the sensitive data was concentrated according to word frequency and semanteme Each sensitive keys word distributes weight;
After the sensitive keys word that weight is greater than default weight threshold is classified, constructs sensitive keys word dictionary library and be described Semantic scene is arranged in each sensitive keys word in sensitive keys word dictionary library.
4. method as claimed in claim 3, which is characterized in that the sensitive keys word dictionary library that the foundation constructs in advance is looked into The sensitive keys word in the Web page face data is looked for include:
The Web page face data is split, multiple serial datas are obtained, according to the sensitive keys word dictionary library point constructed in advance It is other that Keywords matching is carried out to the multiple serial data;
Semantic scene filtering is carried out to the serial data of the sensitive keys word successful match in the sensitive keys word dictionary library;
Filter the serial data that semantic scene is not met in the serial data of the successful match, using serial data remaining after filtering as Sensitive keys word in the Web page face data.
5. method as claimed in claim 4, which is characterized in that do not meet language in the serial data of the filtering successful match The serial data of adopted scene includes: using serial data remaining after filtering as the sensitive keys word in the Web page face data
The location information of the serial data of the successful match, length information and symbolic feature are analyzed, judges the successful match Whether serial data meets semantic scene;
It deletes after not meeting the serial data of semantic scene in the serial data of the successful match, using remaining serial data as described in Sensitive keys word in Web page face data.
6. the method as described in claim 1, which is characterized in that the identification Web page data information to be detected calculates institute The data read for stating Web page includes:
According to the form tag of the Web page and the nest relation of the form tag, the data of the Web page are calculated Degree.
7. method as claimed in claim 6, which is characterized in that the form tag and the table according to the Web page The nest relation of case marker label, the data read for calculating the Web page include:
Using between the table label of the Web page, thead label, tbody label, tr label, th label and td label Nest relation, the data volume size of the Web page is counted, to calculate the data read of the Web page.
8. the method stated such as claim 7, which is characterized in that described to be marked using table label, the thead of the Web page Nest relation between label, tbody label, tr label, th label and td label, counts the data volume size of the Web page Include:
Using between the table label of the Web page, thead label, tbody label, tr label, th label and td label Nest relation, restore the data distribution of data entity, distracter and outlier in the Web page;
After the data volume for calculating separately the every row of Web page table, statistics calculates the data volume of the Web page table.
9. the method according to claim 1, which is characterized in that the identification Web page face data letter to be detected It ceases, before the data read for calculating the Web page further include:
Using the web page address and page feature of Web page to be detected, judge that there are the several of sensitive data for the Web page Rate.
10. method as claimed in claim 9, which is characterized in that the web page address and page using Web page to be detected Region feature, judge the Web page there are the probabilities of sensitive data include:
If the Web page is static page or the Web page includes search engine label characteristics, the Web is determined Probability in the page there are sensitive data is zero.
11. a kind of device for detecting Web page characterized by comprising
Identification module, Web page data information to be detected, calculates the data read of the Web page for identification;
Judgment module, for judging whether the data read is more than or equal to preset data degree threshold value;
Determination module determines the Web page to be detected if being less than the preset data degree threshold value for the data read It is not the sensitive data page.
12. a kind of equipment for detecting Web page characterized by comprising
Memory, for storing computer program;
Processor realizes a kind of detection Web page as described in any one of claims 1 to 10 when for executing the computer program The step of method in face.
13. a kind of computer readable storage medium, which is characterized in that be stored with computer on the computer readable storage medium Program realizes a kind of detection Web page as described in any one of claims 1 to 10 when the computer program is executed by processor Method the step of.
CN201811142835.5A 2018-09-28 2018-09-28 Method, device and equipment for detecting Web page and computer storage medium Active CN109409091B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811142835.5A CN109409091B (en) 2018-09-28 2018-09-28 Method, device and equipment for detecting Web page and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811142835.5A CN109409091B (en) 2018-09-28 2018-09-28 Method, device and equipment for detecting Web page and computer storage medium

Publications (2)

Publication Number Publication Date
CN109409091A true CN109409091A (en) 2019-03-01
CN109409091B CN109409091B (en) 2021-11-19

Family

ID=65466457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811142835.5A Active CN109409091B (en) 2018-09-28 2018-09-28 Method, device and equipment for detecting Web page and computer storage medium

Country Status (1)

Country Link
CN (1) CN109409091B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115222567A (en) * 2022-09-20 2022-10-21 北京润尼尔网络科技有限公司 VR resource and equipment management system
CN117390657A (en) * 2023-12-12 2024-01-12 深圳竹云科技股份有限公司 Data encryption method, device, computer equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022803A (en) * 2015-07-01 2015-11-04 广州市万隆证券咨询顾问有限公司 Method and system for extracting text content of webpage
CN106599155A (en) * 2016-12-07 2017-04-26 北京亚鸿世纪科技发展有限公司 Method and system for classifying web pages
CN106776946A (en) * 2016-12-02 2017-05-31 重庆大学 A kind of detection method of fraudulent website
CN107943954A (en) * 2017-11-24 2018-04-20 杭州安恒信息技术有限公司 Detection method, device and the electronic equipment of webpage sensitive information
US20180191735A1 (en) * 2015-03-20 2018-07-05 Excalibur Ip, Llc Secure Service for Receiving Sensitive Information through Nested iframes
CN108304584A (en) * 2018-03-06 2018-07-20 百度在线网络技术(北京)有限公司 Illegal page detection method, apparatus, intruding detection system and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180191735A1 (en) * 2015-03-20 2018-07-05 Excalibur Ip, Llc Secure Service for Receiving Sensitive Information through Nested iframes
CN105022803A (en) * 2015-07-01 2015-11-04 广州市万隆证券咨询顾问有限公司 Method and system for extracting text content of webpage
CN106776946A (en) * 2016-12-02 2017-05-31 重庆大学 A kind of detection method of fraudulent website
CN106599155A (en) * 2016-12-07 2017-04-26 北京亚鸿世纪科技发展有限公司 Method and system for classifying web pages
CN107943954A (en) * 2017-11-24 2018-04-20 杭州安恒信息技术有限公司 Detection method, device and the electronic equipment of webpage sensitive information
CN108304584A (en) * 2018-03-06 2018-07-20 百度在线网络技术(北京)有限公司 Illegal page detection method, apparatus, intruding detection system and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115222567A (en) * 2022-09-20 2022-10-21 北京润尼尔网络科技有限公司 VR resource and equipment management system
CN117390657A (en) * 2023-12-12 2024-01-12 深圳竹云科技股份有限公司 Data encryption method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN109409091B (en) 2021-11-19

Similar Documents

Publication Publication Date Title
CN107437038B (en) Webpage tampering detection method and device
WO2019218514A1 (en) Method for extracting webpage target information, device, and storage medium
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
CN103914494B (en) Method and system for identifying identity of microblog user
CN110263248A (en) A kind of information-pushing method, device, storage medium and server
CN109194677A (en) A kind of SQL injection attack detection, device and equipment
CN104573711B (en) The image understanding method of object and scene based on text objects scene relation
CN108021651A (en) Network public opinion risk assessment method and device
CN111241389A (en) Sensitive word filtering method and device based on matrix, electronic equipment and storage medium
CN110197389A (en) A kind of user identification method and device
CN109299085A (en) A kind of data processing method, electronic equipment and storage medium
CN110516034A (en) Blog management method, device, the network equipment and readable storage medium storing program for executing
CN109165529A (en) A kind of dark chain altering detecting method, device and computer readable storage medium
CN111177367A (en) Case classification method, classification model training method and related products
CN107239694A (en) A kind of Android application permissions inference method and device based on user comment
CN107688563B (en) Synonym recognition method and recognition device
CN109299469A (en) A method of identifying complicated address in long text
CN106301979B (en) Method and system for detecting abnormal channel
CN113779481A (en) Method, device, equipment and storage medium for identifying fraud websites
CN109409091A (en) Detect method, apparatus, equipment and the computer storage medium of Web page
CN106168968A (en) A kind of Website classification method and device
CN104036190A (en) Method and device for detecting page tampering
CN114692593A (en) Network information safety monitoring and early warning method
CN110189170A (en) Market sentiment analysis method and system
CN104036189A (en) Page distortion detecting method and black link database generating method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant