CN109409091B - Method, device and equipment for detecting Web page and computer storage medium - Google Patents

Method, device and equipment for detecting Web page and computer storage medium Download PDF

Info

Publication number
CN109409091B
CN109409091B CN201811142835.5A CN201811142835A CN109409091B CN 109409091 B CN109409091 B CN 109409091B CN 201811142835 A CN201811142835 A CN 201811142835A CN 109409091 B CN109409091 B CN 109409091B
Authority
CN
China
Prior art keywords
data
web page
sensitive
degree
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811142835.5A
Other languages
Chinese (zh)
Other versions
CN109409091A (en
Inventor
林文洲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sangfor Technologies Co Ltd
Original Assignee
Sangfor Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sangfor Technologies Co Ltd filed Critical Sangfor Technologies Co Ltd
Priority to CN201811142835.5A priority Critical patent/CN109409091B/en
Publication of CN109409091A publication Critical patent/CN109409091A/en
Application granted granted Critical
Publication of CN109409091B publication Critical patent/CN109409091B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/57Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
    • G06F21/577Assessing vulnerabilities and evaluating computer system security
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • G06F21/6245Protecting personal data, e.g. for financial or medical purposes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/034Test or assess a computer or a system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/21Indexing scheme relating to G06F21/00 and subgroups addressing additional information or applications relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/2119Authenticating web pages, e.g. with suspicious links

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Bioethics (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method, a device, equipment and a computer readable storage medium for detecting a Web page, which comprises the following steps: identifying data information of a Web page to be detected, and calculating the data degree of the Web page; judging whether the data degree is greater than or equal to a preset data degree threshold value or not; and if the data degree is smaller than the preset data degree threshold value, judging that the Web page is not a sensitive data page. The method, the device, the equipment and the computer readable storage medium provided by the invention can improve the detection efficiency of the Web page and the reliability of the detection result.

Description

Method, device and equipment for detecting Web page and computer storage medium
Technical Field
The present invention relates to the field of network security technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for detecting a Web page.
Background
With the development of the internet, the global informatization degree is higher and higher, and the accompanying data security problem is more and more serious. Data leakage comprises personal information, enterprise information and national secrets, and hackers can utilize server bugs to acquire large quantities of sensitive data to utilize or flow into black products, thereby threatening personal property or national security.
In the prior art, sensitive data are mainly identified by a method of combining keyword dictionary matching and manual intervention confirmation. The keyword dictionary matching method is mainly used for matching network flow through self-defined keywords, and if the network data is successfully matched with the keyword dictionary, the network data is considered as sensitive data.
When the keyword dictionary matching method is used for identifying the sensitive data, the dependency of the identification effect on the completion degree of the dictionary library is high, and the error judgment is caused by the interference of other data in the identification process; therefore, the accuracy of the recognition result is too low by using the keyword dictionary matching method to recognize the sensitive data.
When sensitive data in network data are identified manually, the identification speed is low, a large amount of manpower is required to be invested, and certain requirements are met for professional literacy of workers. When the sensitive data are identified manually, the judgment of each worker on the sensitive data depends on the subjective judgment consciousness of the worker, and the judgment standards of each worker are different, so that the identification result of the sensitive data is differentiated.
In summary, it can be seen that how to improve the efficiency of Web page data detection is a problem to be solved at present.
Disclosure of Invention
The invention aims to provide a method, a device and equipment for detecting a Web page and a computer readable storage medium, so as to solve the problem that the identification efficiency of sensitive data of the Web page is low in the prior art.
In order to solve the above technical problem, the present invention provides a method for detecting a Web page, including: identifying data information of a Web page to be detected, and calculating the data degree of the Web page; judging whether the data degree is greater than or equal to a preset data degree threshold value or not; and if the data degree is smaller than the preset data degree threshold value, judging that the Web page is not a sensitive data page.
Preferably, the determining whether the data degree is greater than or equal to a preset data degree threshold further includes:
if the data degree is greater than or equal to the preset data degree threshold value, searching for the sensitive keywords in the Web page data according to a pre-constructed sensitive keyword dictionary library;
calculating the sensitivity of the Web page according to the proportion of the sensitive keywords in the Web page data;
judging whether the sensitivity is greater than or equal to a preset sensitivity threshold value or not;
and if the sensitivity is greater than or equal to the preset sensitivity threshold, judging that the Web page is a sensitive data page.
Preferably, before searching for the sensitive keyword in the Web page data according to a pre-constructed sensitive keyword dictionary library, the method further includes:
sensitive keywords of each industry are collected to form a sensitive data set, and weights are distributed to the sensitive keywords in the sensitive data set according to word frequency and semantics;
classifying the sensitive keywords with the weight larger than a preset weight threshold, constructing a sensitive keyword dictionary library, and setting a semantic scene for each sensitive keyword in the sensitive keyword dictionary library.
Preferably, the searching for the sensitive keyword in the Web page data according to a pre-constructed sensitive keyword dictionary library includes:
segmenting the Web page data to obtain a plurality of data strings, and respectively performing keyword matching on the data strings according to a pre-constructed sensitive keyword dictionary library;
performing semantic scene filtering on the data strings successfully matched with the sensitive keywords in the sensitive keyword dictionary library;
and filtering the data strings which do not accord with the semantic scene in the successfully matched data strings, and taking the remaining filtered data strings as sensitive keywords in the Web page data.
Preferably, the filtering the data strings in the successfully matched data strings that do not conform to the semantic scene, and taking the remaining filtered data strings as sensitive keywords in the Web page data includes:
analyzing the position information, the length information and the symbolic characteristics of the successfully matched data string, and judging whether the successfully matched data string conforms to a semantic scene;
and after deleting the data strings which do not accord with the semantic scene in the successfully matched data strings, taking the rest data strings as sensitive keywords in the Web page data.
Preferably, the identifying the data information of the Web page to be detected, and the calculating the data degree of the Web page includes:
and calculating the data degree of the Web page according to the form label of the Web page and the nesting relation of the form label.
Preferably, the calculating the data degree of the Web page according to the table tag of the Web page and the nesting relationship of the table tags includes:
and counting the data size of the Web page by using the nesting relation among the table label, the thead label, the tbody label, the tr label, the th label and the td label of the Web page so as to calculate the data degree of the Web page.
Preferably, the counting the data size of the Web page by using the nesting relationship among the table tag, the thead tag, the tbody tag, the tr tag, the th tag and the td tag of the Web page includes:
restoring the data distribution of the data entity, the interference item and the irrelevant item in the Web page by using the nesting relation among the table label, the thead label, the tbody label, the tr label, the th label and the td label of the Web page;
and respectively calculating the data volume of each row of the Web page table, and then counting the data volume of the Web page table.
Preferably, the identifying the data information of the Web page to be detected and calculating the data degree of the Web page further includes:
and judging the probability of sensitive data of the Web page by utilizing the webpage address and the page characteristics of the Web page to be detected.
Preferably, the determining, by using the Web page address and the page feature of the Web page to be detected, the probability that the Web page has sensitive data includes:
and if the Web page is a static page or the Web page contains the characteristics of the search engine tag, judging that the probability of sensitive data in the Web page is zero.
The invention also provides a device for detecting the Web page, which comprises the following components:
the identification module is used for identifying the data information of the Web page to be detected and calculating the data degree of the Web page;
the judging module is used for judging whether the data degree is greater than or equal to a preset data degree threshold value or not;
and the judging module is used for judging that the Web page to be detected is not a sensitive data page if the data degree is smaller than the preset data degree threshold.
The invention also provides a device for detecting the Web page, which comprises the following components:
a memory for storing a computer program; a processor for implementing the steps of the method for detecting Web pages described above when executing the computer program.
The invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the steps of a method of detecting Web pages as described above.
The method for detecting the Web page provided by the invention utilizes the data information of the Web page to be detected to calculate the data degree of the Web page. When the data degree of the Web page to be detected is smaller, the data volume of the Web page is smaller, and the possibility that the Web page is a sensitive data page is greatly reduced. Therefore, the data degree of the Web page is judged by utilizing a preset data degree threshold; and when the data degree of the Web page is smaller than the preset data degree threshold value, judging that the Web page is not sensitive page data. According to the method for detecting the Web page, the data degree of the Web page is introduced as a measurement parameter of the sensitive data page, and a preset data degree threshold value is set as a uniform measurement standard; web pages that are not sensitive data pages can be quickly detected. By utilizing the method provided by the invention, the problem that the detection efficiency of the Web page is lower because a large number of workers with professional bases are needed to finish the detection of whether the Web page is a sensitive data page in the prior art is solved; the problem of differentiation of detection results of the same Web page is caused due to different subjective judgment standards of various workers; the detection efficiency of the Web page and the reliability of the detection result are improved.
Drawings
In order to more clearly illustrate the embodiments or technical solutions of the present invention, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without creative efforts.
FIG. 1 is a flowchart of a first embodiment of a method for detecting a Web page according to the present invention;
FIG. 2 is a flowchart of a second embodiment of a method for detecting a Web page according to the present invention;
FIG. 3 is a flowchart of a third embodiment of a method for detecting a Web page according to the present invention;
fig. 4 is a block diagram of a structure of an apparatus for detecting a Web page according to an embodiment of the present invention;
fig. 5 is a structural diagram of a device for detecting a Web page according to an embodiment of the present invention.
Detailed Description
The core of the invention is to provide a method, a device and equipment for detecting a Web page and a computer readable storage medium, thereby improving the detection efficiency of the Web page.
In order that those skilled in the art will better understand the disclosure, the invention will be described in further detail with reference to the accompanying drawings and specific embodiments. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1, fig. 1 is a flowchart illustrating a method for detecting a Web page according to a first embodiment of the present invention; the specific operation steps are as follows:
step S101: identifying data information of a Web page to be detected, and calculating the data degree of the Web page;
in this embodiment, the data distribution of the Web page may be restored according to the nesting relationship of the form tags by counting the form tags of the Web page, and the statistical result is matched with a defined data page model, so as to calculate the data degree of the Web page.
The form tag of the Web page includes: table tag, thead tag, tbody tag, tr tag, th tag, and td tag. The data size of the Web page can be counted through the nesting relationship among a table label, a thead label, a tbody label, a tr label, a th label and a td label. The nesting relation comprises the position relation of the six labels; such as: whether a thead label or a tbody label exists in a table label is judged, the thead label and the tbody label are respectively used for representing the header content and the data part of the table, and a tr label, a th label and a td label describe each data entity.
The nesting relationship of the form label in this embodiment includes some common nesting relationships, such as the position relationship of the three tr, th and td labels in the thead label and the tbody label; there are also some special nesting relationships, such as: td tags contain various special nesting relationships such as table tags, only td tags without th tags, and the like.
According to the nesting relation of the form labels, the data distribution condition of the Web page can be restored, and the Web page data are counted, wherein the Web page data belong to a data entity, the Web page data are interference options, and the Web page data are irrelevant items; and respectively calculating the data volume of each line in the table by taking one line of the table as a unit, and then comprehensively calculating the data volume of the whole table.
It should be noted that the data page model using the restored data distribution is suitable for calculating the form type data degree. Taking the sensitive data page of the financial reimbursement system as an example, the sensitivity includes the related information such as money amount, telephone number, single number, etc., and the data degree is the corresponding data value in each form.
Step S102: judging whether the data degree is greater than or equal to a preset data degree threshold value or not;
step S103: and if the data degree is smaller than the preset data degree threshold value, judging that the Web page is not a sensitive data page.
In this embodiment, the nested relation among the form tags of the Web page is utilized to restore the distribution situation of the data entities, the irrelevant options and the interference options in the Web page, the data situation of the page is integrally sensed, the form of the data degree is used as the basis for detecting the sensitivity, a large amount of human resources are saved, and the detection efficiency of the Web page is improved. Setting a preset data degree threshold value as a uniform measurement standard; the credibility of the detection result of the Web page is improved.
Based on the foregoing embodiment, in this example, if the data degree of the Web page is greater than or equal to the preset data degree threshold, calculating the sensitivity of the Web page and determining the sensitivity, so as to determine whether the Web page is a sensitive data page.
Referring to fig. 2, fig. 2 is a flowchart illustrating a method for detecting a Web page according to a second embodiment of the present invention; the specific operation steps are as follows:
step S201: calculating the data degree of the Web page according to the form label of the Web page to be detected and the nesting relation of the form label;
step S202: judging whether the data degree is greater than or equal to a preset data degree threshold value or not;
step S203: if the data degree is greater than or equal to the preset data degree threshold value, searching for the sensitive keywords in the Web page data according to a pre-constructed sensitive keyword dictionary library;
searching for sensitive keywords in the Web page data according to a pre-constructed sensitive word dictionary library; collecting sensitive keywords existing in each industry to form a sensitive data set, and screening and training the sensitive data set; distributing weights for each sensitive keyword according to word frequency and semantics; and classifying the sensitive keywords with the weights larger than a preset weight threshold value, and then constructing the sensitive keyword dictionary library.
When searching for sensitive data in the Web page data to be detected according to the sensitive keyword dictionary library, the Web page data may be first divided into a plurality of data strings, and the plurality of data strings are respectively matched with the sensitive keywords in the sensitive keyword dictionary library. Semantic scene filtering is carried out on the successfully matched data strings; and judging whether the successfully matched data string conforms to the semantic scene of the corresponding sensitive keyword in the sensitive keyword library or not according to the position information, the length information and the symbolic characteristics of the successfully matched data string. And deleting the data strings which do not accord with the semantic scene, and taking the rest data strings as sensitive keywords of the Web page.
Step S204: calculating the sensitivity of the Web page according to the proportion of the sensitive keywords in the Web page data;
step S205: judging whether the sensitivity is greater than or equal to a preset sensitivity threshold value or not;
step S206: and if the sensitivity is greater than or equal to the preset sensitivity threshold, judging that the Web page is a sensitive data page.
In the method for detecting a Web page provided by this embodiment, when the data degree of the Web page to be detected is greater than or equal to the preset data degree threshold, the sensitivity of the Web page is determined. Dividing the Web page data into a plurality of data strings, matching the Web page data by utilizing a pre-established sensitive keyword dictionary library, filtering semantic scenes of the data strings successfully matched with the sensitive keyword dictionary library, and taking the data strings conforming to the semantic scenes of the sensitive keywords in the sensitive keyword dictionary library as the sensitive keywords of the Web page so as to calculate the sensitivity of the Web page. Compared with the prior art that whether a network page is a sensitive page is judged by matching with a keyword dictionary, the method and the device can more accurately find the sensitive keywords of the Web page, so that the sensitivity of the Web page is more accurate. On the basis that the Web page calculation result is more accurate, the method and the device detect whether the Web page is a sensitive data page or not by utilizing the sensitivity and the data degree, and improve the accuracy of the detection result.
Based on the above embodiment, in this embodiment, the sensitivity of the Web page to be detected may be determined first; and when the sensitivity of the Web page is more than or equal to a preset sensitivity threshold, judging the data degree of the Web page so as to judge whether the Web page is a sensitive data page. Before the sensitivity or the data degree of the Web page is calculated, the page without the sensitive data obviously can be eliminated according to the webpage address and the page characteristics of the Web page to be detected, so that the detection efficiency is improved, and the detection accuracy is increased.
Referring to fig. 3, fig. 3 is a flowchart illustrating a method for detecting a Web page according to a third embodiment of the present invention; the specific operation steps are as follows:
step S301: judging whether the probability of sensitive data existing in the Web page to be detected is zero or not;
in the embodiment, the Web page to be detected can be pre-filtered by utilizing the webpage address and the webpage characteristics, and the Web page obviously without sensitive data is eliminated, so that the detection efficiency and the detection accuracy are improved. Such as: if the Web page is a static page or contains search engine tag features, the Web page cannot be a sensitive data page.
When the Web page address of the Web is some static page, such as HTML, the probability of the Web page having sensitive data is zero, and therefore, the Web page does not need to be subjected to sensitive data detection.
When the Web page contains certain search engine tag characteristics, such as keyword and the like; sensitive data detection is not performed on the Web page.
Step S302: if the probability of the sensitive data existing in the Web page is not zero, searching for the sensitive keywords in the Web page data to be detected according to a pre-constructed sensitive keyword dictionary library;
step S303: filtering the sensitive keywords which do not accord with the semantic scene in the sensitive keywords, and taking the remaining sensitive keywords after filtering as the sensitive keywords in the Web page;
and after the operations of matching data source extraction, keyword pattern matching and semantic scene analysis are carried out on the Web page data, calculating to obtain the sensitivity of the Web page.
In the above embodiment, the Web page data is divided into a plurality of data strings, that is, a plurality of matching sources are obtained. And performing keyword pattern matching, namely performing keyword matching on the plurality of data strings according to the sensitive keyword dictionary library, obtaining the successfully matched data strings, and performing semantic scene analysis to obtain the sensitive keywords of the Web page.
And performing semantic scene filtering on the successfully matched data strings in the Web page, wherein the semantic scene filtering comprises the analysis of whether the keyword position, the matched data length, the specific symbol, the characteristic number and the like accord with Chinese semantics. For example: if the extracted data string is "amount: 1234 yuan, the semantic scene defined by the sensitive keyword money is satisfied, and the semantic scene is not satisfied by the data string money amount.
Step S304: calculating the sensitivity of the Web page according to the proportion of the sensitive keywords of the Web page in the Web page data;
step S305: judging whether the sensitivity is greater than or equal to a preset sensitivity threshold value or not;
step S306: if the sensitivity is larger than or equal to the preset sensitivity threshold, calculating the data degree of the Web page according to the table label of the Web page and the nesting relation of the table label;
step S307: judging whether the data degree is greater than or equal to a preset data degree threshold value or not;
step S308: and if the data degree is greater than or equal to the preset data degree threshold value, judging that the Web page is a sensitive data page.
In this embodiment, first, pre-filtering is performed on a Web page according to page characteristics, such as a page website of the Web page to be detected and the page tag. And after the Web page obviously without the sensitive data is eliminated, matching and semantic scene analysis are carried out on the Web page data according to a pre-established sensitive keyword dictionary library, so that the sensitivity of the Web page is obtained through calculation. When the sensitivity is larger than or equal to the preset sensitivity threshold, the distribution conditions of the data entities, the irrelevant options and the interference options in the Web page are restored by utilizing the nesting relation among the form labels, the data condition of the page is sensed integrally, and the form of data volume is used as the basis for detecting the sensitivity. In the embodiment, the information obtained by the matching information obtained by Chinese semantic analysis and the information obtained by data page identification, namely the sensitivity of the Web page and the size and weight of the data degree; and whether the Web page is a sensitive data page or not is judged according to the sensitivity of the Web page and the size and the weight of the data degree, so that the human resources are greatly saved, and the efficiency of Web page detection and the accuracy of a detection result are improved.
Referring to fig. 4, fig. 4 is a block diagram of a device for detecting a Web page according to an embodiment of the present invention; the specific device may include:
the identification module 100 is configured to identify data information of a Web page to be detected and calculate a data degree of the Web page;
the judging module 200 is configured to judge whether the data degree is greater than or equal to a preset data degree threshold;
the determining module 300 is configured to determine that the Web page to be detected is not a sensitive data page if the data degree is smaller than the preset data degree threshold.
The apparatus for detecting a Web page of this embodiment is configured to implement the foregoing method for detecting a Web page, and therefore a specific implementation manner of the apparatus for detecting a Web page may be found in the foregoing embodiment portions of the method for detecting a Web page, for example, the identification module 100, the judgment module 200, and the judgment module 300, which are respectively configured to implement steps S101, S102, and S103 in the foregoing method for detecting a Web page.
Referring to fig. 5, fig. 5 is a structural diagram of an apparatus for detecting a Web page according to an embodiment of the present invention; the specific equipment may include:
a memory 100 for storing a computer program;
a processor 200 for implementing the steps of one of the above-described methods of detecting Web pages when executing said computer program.
A specific embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the computer program implements the steps of the method for detecting a Web page.
The embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The method, apparatus, device and computer readable storage medium for detecting Web pages provided by the present invention are described in detail above. The principles and embodiments of the present invention are explained herein using specific examples, which are presented only to assist in understanding the method and its core concepts. It should be noted that, for those skilled in the art, it is possible to make various improvements and modifications to the present invention without departing from the principle of the present invention, and those improvements and modifications also fall within the scope of the claims of the present invention.

Claims (12)

1. A method for detecting Web pages, comprising:
identifying data information of a Web page to be detected, and calculating the data degree of the Web page; the data degree is a corresponding data value in each form;
judging whether the data degree is greater than or equal to a preset data degree threshold value or not;
if the data degree is smaller than the preset data degree threshold value, judging that the Web page is not a sensitive data page;
the method for identifying the data information of the Web page to be detected comprises the following steps of:
and counting the data size in the Web page according to the table tag of the Web page and the nesting relation of the table tag, and calculating the data degree according to the data size.
2. The method of claim 1, wherein said determining whether said data degree is greater than a preset data degree threshold further comprises:
if the data degree is greater than or equal to the preset data degree threshold value, searching for the sensitive keywords in the Web page data according to a pre-constructed sensitive keyword dictionary library;
calculating the sensitivity of the Web page according to the proportion of the sensitive keywords in the Web page data;
judging whether the sensitivity is greater than or equal to a preset sensitivity threshold value or not;
and if the sensitivity is greater than or equal to the preset sensitivity threshold, judging that the Web page is a sensitive data page.
3. The method of claim 2, wherein before searching for the sensitive keyword in the Web page data according to the pre-constructed sensitive keyword dictionary database, the method further comprises:
sensitive keywords of each industry are collected to form a sensitive data set, and weights are distributed to the sensitive keywords in the sensitive data set according to word frequency and semantics;
classifying the sensitive keywords with the weight larger than a preset weight threshold, constructing a sensitive keyword dictionary library, and setting a semantic scene for each sensitive keyword in the sensitive keyword dictionary library.
4. The method of claim 3, wherein the searching for the sensitive keyword in the Web page data according to the pre-constructed sensitive keyword dictionary library comprises:
segmenting the Web page data to obtain a plurality of data strings, and respectively performing keyword matching on the data strings according to a pre-constructed sensitive keyword dictionary library;
performing semantic scene filtering on the data strings successfully matched with the sensitive keywords in the sensitive keyword dictionary library;
and filtering the data strings which do not accord with the semantic scene in the successfully matched data strings, and taking the remaining filtered data strings as sensitive keywords in the Web page data.
5. The method of claim 4, wherein the filtering data strings which do not conform to semantic scenes in the successfully matched data strings, and the using the remaining filtered data strings as sensitive keywords in the Web page data comprises:
analyzing the position information, the length information and the symbolic characteristics of the successfully matched data string, and judging whether the successfully matched data string conforms to a semantic scene;
and after deleting the data strings which do not accord with the semantic scene in the successfully matched data strings, taking the rest data strings as sensitive keywords in the Web page data.
6. The method of claim 1, wherein said calculating the datametric of the Web page based on the table tag of the Web page and the nesting relationship of the table tags comprises:
and counting the data size of the Web page by using the nesting relation among the table label, the thead label, the tbody label, the tr label, the th label and the td label of the Web page so as to calculate the data degree of the Web page.
7. The method of claim 6 wherein said utilizing the nesting relationship between the table tab, the thead tab, the tbody tab, tr tab, th tab, and td tab of the Web page, accounting for the data size of the Web page comprises:
restoring the data distribution of the data entity, the interference item and the irrelevant item in the Web page by using the nesting relation among the table label, the thead label, the tbody label, the tr label, the th label and the td label of the Web page;
and respectively calculating the data volume of each row of the Web page table, and then counting the data volume of the Web page table.
8. The method according to any one of claims 1 to 7, wherein the identifying the data information of the Web page to be detected and before calculating the data degree of the Web page further comprises:
and judging the probability of sensitive data of the Web page by utilizing the webpage address and the page characteristics of the Web page to be detected.
9. The method of claim 8, wherein the determining the probability that the Web page has sensitive data by using the Web page address and the page feature of the Web page to be detected comprises:
and if the Web page is a static page or the Web page contains the characteristics of the search engine tag, judging that the probability of sensitive data in the Web page is zero.
10. An apparatus for detecting Web pages, comprising:
the identification module is used for identifying the data information of the Web page to be detected and calculating the data degree of the Web page; the data degree is a corresponding data value in each form;
the judging module is used for judging whether the data degree is greater than or equal to a preset data degree threshold value or not;
and the judging module is used for judging that the Web page to be detected is not a sensitive data page if the data degree is smaller than the preset data degree threshold.
11. An apparatus for detecting Web pages, comprising:
a memory for storing a computer program;
a processor for implementing the steps of a method of detecting Web pages as claimed in any one of claims 1 to 9 when executing said computer program.
12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of a method of detecting Web pages as claimed in any one of claims 1 to 9.
CN201811142835.5A 2018-09-28 2018-09-28 Method, device and equipment for detecting Web page and computer storage medium Active CN109409091B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811142835.5A CN109409091B (en) 2018-09-28 2018-09-28 Method, device and equipment for detecting Web page and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811142835.5A CN109409091B (en) 2018-09-28 2018-09-28 Method, device and equipment for detecting Web page and computer storage medium

Publications (2)

Publication Number Publication Date
CN109409091A CN109409091A (en) 2019-03-01
CN109409091B true CN109409091B (en) 2021-11-19

Family

ID=65466457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811142835.5A Active CN109409091B (en) 2018-09-28 2018-09-28 Method, device and equipment for detecting Web page and computer storage medium

Country Status (1)

Country Link
CN (1) CN109409091B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115222567B (en) * 2022-09-20 2023-04-25 北京润尼尔网络科技有限公司 VR resource and equipment management system
CN117390657A (en) * 2023-12-12 2024-01-12 深圳竹云科技股份有限公司 Data encryption method, device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022803A (en) * 2015-07-01 2015-11-04 广州市万隆证券咨询顾问有限公司 Method and system for extracting text content of webpage
CN106599155A (en) * 2016-12-07 2017-04-26 北京亚鸿世纪科技发展有限公司 Method and system for classifying web pages
CN106776946A (en) * 2016-12-02 2017-05-31 重庆大学 A kind of detection method of fraudulent website
CN107943954A (en) * 2017-11-24 2018-04-20 杭州安恒信息技术有限公司 Detection method, device and the electronic equipment of webpage sensitive information
CN108304584A (en) * 2018-03-06 2018-07-20 百度在线网络技术(北京)有限公司 Illegal page detection method, apparatus, intruding detection system and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9251372B1 (en) * 2015-03-20 2016-02-02 Yahoo! Inc. Secure service for receiving sensitive information through nested iFrames

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105022803A (en) * 2015-07-01 2015-11-04 广州市万隆证券咨询顾问有限公司 Method and system for extracting text content of webpage
CN106776946A (en) * 2016-12-02 2017-05-31 重庆大学 A kind of detection method of fraudulent website
CN106599155A (en) * 2016-12-07 2017-04-26 北京亚鸿世纪科技发展有限公司 Method and system for classifying web pages
CN107943954A (en) * 2017-11-24 2018-04-20 杭州安恒信息技术有限公司 Detection method, device and the electronic equipment of webpage sensitive information
CN108304584A (en) * 2018-03-06 2018-07-20 百度在线网络技术(北京)有限公司 Illegal page detection method, apparatus, intruding detection system and storage medium

Also Published As

Publication number Publication date
CN109409091A (en) 2019-03-01

Similar Documents

Publication Publication Date Title
CN103544436A (en) System and method for distinguishing phishing websites
CN110060087B (en) Abnormal data detection method, device and server
CN110602045A (en) Malicious webpage identification method based on feature fusion and machine learning
CN109165529B (en) Dark chain tampering detection method and device and computer readable storage medium
CN111553137B (en) Report generation method and device, storage medium and computer equipment
CN107688563B (en) Synonym recognition method and recognition device
CN109409091B (en) Method, device and equipment for detecting Web page and computer storage medium
CN113535813A (en) Data mining method and device, electronic equipment and storage medium
CN114595765A (en) Data processing method and device, electronic equipment and storage medium
CN114692593A (en) Network information safety monitoring and early warning method
CN110019762B (en) Problem positioning method, storage medium and server
CN111861733B (en) Fraud prevention and control system and method based on address fuzzy matching
CN110781497B (en) Method for detecting web page link and storage medium
CN111460803B (en) Equipment identification method based on Web management page of industrial Internet of things equipment
CN111125704B (en) Webpage Trojan horse recognition method and system
CN112487808A (en) Big data based news message pushing method, device, equipment and storage medium
CN112579781A (en) Text classification method and device, electronic equipment and medium
CN117033552A (en) Information evaluation method, device, electronic equipment and storage medium
CN114817518B (en) License handling method, system and medium based on big data archive identification
KR20210029006A (en) Product Evolution Mining Method And Apparatus Thereof
CN110598115A (en) Sensitive webpage identification method and system based on artificial intelligence multi-engine
CN114884686B (en) PHP threat identification method and device
US10586046B1 (en) Automated security feed analysis for threat assessment
CN115563288A (en) Text detection method and device, electronic equipment and storage medium
CN113962216A (en) Text processing method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant