CN113378172A

CN113378172A - Method, apparatus, computer system, and medium for identifying sensitive web pages

Info

Publication number: CN113378172A
Application number: CN202010118033.1A
Authority: CN
Inventors: 李斌; 李国辉; 李凯
Original assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Current assignee: Qianxin Technology Group Co Ltd; Secworld Information Technology Beijing Co Ltd
Priority date: 2020-02-25
Filing date: 2020-02-25
Publication date: 2021-09-10
Anticipated expiration: 2040-02-25
Also published as: CN113378172B

Abstract

The present disclosure provides a method for identifying sensitive web pages, comprising: acquiring network assets in a target network space, wherein the network assets comprise various types of assets; determining web page assets from the network assets; and identifying the web pages in the web page assets to determine sensitive web pages in the web page assets. The present disclosure also provides an apparatus for identifying sensitive web pages, a computer system, a computer readable storage medium and a computer program product.

Description

Method, apparatus, computer system, and medium for identifying sensitive web pages

Technical Field

The present disclosure relates to a method for identifying a sensitive web page, an apparatus for identifying a sensitive web page, a computer system, a computer readable storage medium and a computer program product.

Background

With the continuous growth of enterprise business and the high-speed development of business informatization, various business support platforms and management systems become more and more complex. Network assets, such as host operating systems, databases, middleware, application components, servers, storage devices, network devices, security devices, etc., are increasing in number and increasing in types, which makes the asset management work of administrators more difficult. If the network assets are not brought into the daily maintenance range of the administrator, more and more bugs and illegal configuration can occur, and further great hidden dangers can be brought to enterprise safety, and the network assets become soft ribs of enterprise information safety. Therefore, it is necessary to frequently perform maintenance on the network assets, for example, security detection.

However, in the process of implementing the present disclosure, the inventors found that mining the asset attributes of the network assets in the related art is not deep, which results in an increase in the operation and maintenance workload and further results in deterioration of timeliness of the operation work.

Disclosure of Invention

One aspect of the present disclosure provides a method for identifying a sensitive web page, comprising: acquiring network assets in a target network space, wherein the network assets comprise various types of assets; determining web page assets from the network assets; and identifying the web pages in the web page assets to determine sensitive web pages in the web page assets.

Optionally, identifying a web page in the web page asset to determine a sensitive web page in the web page asset includes: aiming at each webpage in the webpage assets, loading static content pointed by a uniform resource address corresponding to the current webpage to be identified by using a headless browser; analyzing the page structure of the static content to determine whether the current webpage has a script to be executed; executing the script to be executed under the condition that the script to be executed exists in the current webpage, and acquiring dynamic content; and identifying the static content and the dynamic content of the current webpage to determine whether the current webpage is a sensitive webpage.

Optionally, identifying the static content and the dynamic content of the current webpage to determine whether the current webpage is a sensitive webpage includes: carrying out sensitivity characteristic matching on the static content and the dynamic content of the current webpage and the sensitive words in the sensitive word set to obtain a matching result; according to the matching result, carrying out sensitivity scoring on the current webpage to obtain a scoring result; and determining whether the current webpage is a sensitive webpage according to the scoring result.

Optionally, the sensitive word sets at least include a first sensitive word set and a second sensitive word set, the first sensitive word set is assigned with a first weight value, the second sensitive word set is assigned with a second weight value, the sensitive words in the first sensitive word set are different from the sensitive words in the second sensitive word set, and the first sensitive word set and the second sensitive word set are respectively used for matching contents of different parts in a web page, where:

and matching the sensitivity characteristics of the static content and the dynamic content of the current webpage with the sensitive words in the sensitive word set to obtain a matching result, wherein the matching result comprises the following steps: dividing the page content of the current webpage into at least a first part and a second part according to a page tag, wherein the page content comprises the static content and the dynamic content; and

carrying out sensitivity characteristic matching on the first part and the sensitive words in the first sensitive word set to obtain a first matching result; and performing sensitivity characteristic matching on the second part and the sensitive words in the second sensitive word set to obtain a second matching result.

And according to the matching result, carrying out sensitivity scoring on the current webpage, wherein the step of obtaining a scoring result comprises the following steps: and according to the first matching result, the second matching result, the first weight value and the second weight value, carrying out sensitivity scoring on the current webpage to obtain a scoring result.

Optionally, the method further includes: determining the sensitive content matched with the sensitive words in the current webpage according to the matching result; acquiring the position of the sensitive content in the current webpage; and sending the scoring result of the current webpage and the position of the sensitive content in the current webpage to a caller.

Optionally, the determining the web page asset from the network assets comprises: identifying a protocol type for each of said network assets; and determining an asset for which the protocol type belongs to the hypertext transfer protocol type as the web page asset.

Another aspect of the present disclosure provides an apparatus for identifying a sensitive web page, comprising: the system comprises an acquisition module, a storage module and a processing module, wherein the acquisition module is used for acquiring network assets in a target network space, and the network assets comprise various types of assets; a determining module, configured to determine a web asset from the network asset; and the identification module is used for identifying the web pages in the web page assets so as to determine the sensitive web pages in the web page assets.

Optionally, the identification module includes: a loading unit, configured to load, for each web page in the web page asset, static content pointed to by a uniform resource address corresponding to a current web page to be identified by using a headless browser; the analysis unit is used for analyzing the page structure of the static content to determine whether the current webpage has a script to be executed or not; the execution unit is used for executing the script to be executed and acquiring dynamic content under the condition that the script to be executed exists in the current webpage; and the first identification unit is used for identifying the static content and the dynamic content of the current webpage so as to determine whether the current webpage is a sensitive webpage.

Optionally, the first identifying unit is configured to: carrying out sensitivity characteristic matching on the static content and the dynamic content of the current webpage and the sensitive words in the sensitive word set to obtain a matching result; according to the matching result, carrying out sensitivity scoring on the current webpage to obtain a scoring result; and determining whether the current webpage is a sensitive webpage according to the scoring result.

and matching the sensitivity characteristics of the static content and the dynamic content of the current webpage with the sensitive words in the sensitive word set to obtain a matching result, wherein the matching result comprises the following steps: dividing the page content of the current webpage into at least a first part and a second part according to a page tag, wherein the page content comprises the static content and the dynamic content; carrying out sensitivity characteristic matching on the first part and the sensitive words in the first sensitive word set to obtain a first matching result; and performing sensitivity characteristic matching on the second part and the sensitive words in the second sensitive word set to obtain a second matching result.

Optionally, the first identification unit is further configured to determine, according to the matching result, sensitive content that is matched with the sensitive word in the current webpage; acquiring the position of the sensitive content in the current webpage; and sending the scoring result of the current webpage and the position of the sensitive content in the current webpage to a caller.

Optionally, the determining module includes: a second identification unit for identifying the protocol type of each type of assets in the network assets; and the determining unit is used for determining the assets of which the protocol type belongs to the hypertext transfer protocol type as the webpage assets.

Another aspect of the present disclosure provides a computer system comprising: one or more processors; a readable storage medium for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described above.

Another aspect of the disclosure provides a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to implement the method as described above.

Another aspect of the disclosure provides a computer program product comprising executable instructions that, when executed by a processor, cause the processor to implement the method as described above.

Drawings

For a more complete understanding of the present disclosure and the advantages thereof, reference is now made to the following descriptions taken in conjunction with the accompanying drawings, in which:

fig. 1 schematically illustrates an application scenario of a method and apparatus for identifying a sensitive web page according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow diagram of a method for identifying sensitive web pages in accordance with an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram for identifying web pages in a web page asset to determine sensitive web pages in the web page asset according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a flow chart for identifying static and dynamic content of a current web page to determine whether the current web page is a sensitive web page according to an embodiment of the disclosure;

FIG. 5 schematically shows a schematic diagram of a sensitive page according to an embodiment of the disclosure;

FIG. 6 schematically illustrates a block diagram of an apparatus for identifying sensitive web pages in accordance with an embodiment of the present disclosure; and

FIG. 7 schematically illustrates a block diagram of a computer system suitable for implementing the above-described method, according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B and C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.). Where a convention analogous to "A, B or at least one of C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B or C" would include but not be limited to systems that have a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Some block diagrams and/or flow diagrams are shown in the figures. It will be understood that some blocks of the block diagrams and/or flowchart illustrations, or combinations thereof, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the instructions, which execute via the processor, create means for implementing the functions/acts specified in the block diagrams and/or flowchart block or blocks. The techniques of this disclosure may be implemented in hardware and/or software (including firmware, microcode, etc.). In addition, the techniques of this disclosure may take the form of a computer program product on a computer-readable storage medium having instructions stored thereon for use by or in connection with an instruction execution system.

Embodiments of the present disclosure provide a method for identifying a sensitive web page, the method including obtaining network assets including multiple types of assets in a target network space, and then determining web page assets from the network assets; and identifying the web pages in the web page assets to determine the sensitive web pages in the web page assets.

Fig. 1 schematically illustrates an application scenario of a method and apparatus for identifying a sensitive web page according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a scenario in which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, but does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in FIG. 1, an electronic device 101 may obtain network assets in a network space 102, where the network assets in the network space 102 include, for example and without limitation, host operating systems, databases, middleware, application components, web pages, servers, storage devices, network devices, security devices, and the like.

According to embodiments of the present disclosure, the electronic device 101 may be a hardware product that provides network asset mapping and management. Network asset mapping (also referred to as cyberspace mapping) is a method for constructing a network asset map and managing network assets by using some technical methods to detect node distribution and network relationship indexes in a certain internet space or a user private network space.

According to the embodiment of the disclosure, after the electronic device 101 is networked with the network space 102 and deployed, the network assets in the network space 102 can be obtained through one-time comprehensive scanning. For example, after scanning by the electronic device 101, the network space 102 is found to have 10000 network assets, wherein after filtering the 10000 network assets, 3000 web assets (i.e. web page assets) are obtained. By identifying web assets, a plurality of assets with apparent sensitivity are discovered 200, which may include, for example, mail system pages, pages with file upload entries, server error information, and the like, and a report evaluating the degree of sensitivity may be generated for each asset with apparent sensitivity.

According to the embodiment of the disclosure, for the assets with higher sensitivity, higher safety risks are shown, the assets with higher safety risks can be safely operated, and high-risk sensitivity can be repaired, so that potential safety hazards are eliminated.

FIG. 2 schematically shows a flow diagram of a method for identifying sensitive web pages in accordance with an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S210 to S230.

In operation S210, network assets in a target network space are obtained, wherein the network assets include a plurality of types of assets.

According to an embodiment of the present disclosure, for example, the network assets that survive in the target network space may be scanned by a certain scanning means (e.g., using a scanning tool such as nmap or zmapv 6) in an IP address + port manner.

According to the embodiments of the present disclosure, the target network space may be, for example, a network space of an enterprise, a network space of a scientific research institution, or the like, or a network space of an organization such as a government, or the like. The types of assets may be, for example, host assets, database assets, network device assets, security device assets, operating system assets, mail system assets, web page assets, and the like.

In operation S220, web page assets are determined from the network assets.

According to the embodiment of the disclosure, in the obtained network assets, the protocol type of each type of assets in the network assets can be identified, and the assets of which the protocol types belong to the hypertext transfer protocol type are determined as the webpage assets.

According to embodiments of the present disclosure, because web page assets are typically communicated using the HTTP protocol (i.e., hypertext transfer protocol), web page assets can be filtered out of network assets that include multiple types of assets by identifying the protocol type.

In operation S230, web pages in the web page asset are identified to determine sensitive web pages in the web page asset.

According to the method and the device for identifying the web page assets, the web page assets are determined from the network assets comprising the assets of various types, and the web pages in the web page assets are identified, so that the sensitive web pages are determined. Furthermore, the sensitive webpages with higher safety risks can be safely operated, and high-risk sensitivity can be repaired, so that potential safety hazards are eliminated. The method and the system refine the asset attributes of the network assets, directly identify the webpage assets, and can efficiently identify security holes or illegal configuration aiming at the webpage.

Through the embodiment of the disclosure, because the sensitivity identification is directly carried out on the webpage assets, the problems that the operation and maintenance workload is increased and the timeliness of the operation work is further deteriorated due to the fact that the asset attribute of the network assets is not deeply excavated in the related technology can be solved, and the purpose of accurately determining the security loophole or illegal configuration existing in the network assets is achieved.

The method shown in fig. 2 is further described with reference to fig. 3-5 in conjunction with specific embodiments.

At present, the contents of a large number of web page assets include dynamic contents, the current universal crawler technology can generally only obtain static contents of a page through a URL, but cannot execute a JavaScript script in the page to load dynamically generated contents, so that a large amount of information is lost, and the sensitivity of the web page assets cannot be effectively identified. Therefore, in order to improve the identification accuracy of the asset sensitivity of the web page, all the content on the web page needs to be identified.

FIG. 3 schematically illustrates a flow diagram for identifying web pages in a web page asset to determine sensitive web pages in the web page asset according to an embodiment of the disclosure.

As shown in FIG. 3, identifying web pages in the web page asset to determine sensitive web pages in the web page asset includes operations S310-S340.

In operation S310, for each web page in the web page asset, static content pointed to by a uniform resource address corresponding to a current web page to be identified is loaded using a headless browser.

According to the embodiment of the disclosure, the headless browser is a browser without a graphical user interface, and can be applied to the work of web crawlers, automatic tests and the like. Each page may have a corresponding uniform resource address URL. The static content of the current webpage to be identified can be acquired by loading the page pointed by the URL through the headless browser.

In operation S320, the page structure of the static content is parsed to determine whether a script to be executed exists in the current web page.

According to the embodiment of the disclosure, for example, the page structure of the static content is an HTML structure, and the HTML structure can be analyzed to determine whether a JavaScript script needs to be loaded and executed.

In operation S330, in a case that it is determined that the script to be executed exists in the current webpage, the script to be executed is executed, and the dynamic content is acquired.

According to the embodiment of the disclosure, in the process of executing the script, the script can be sequentially executed according to the introduction sequence of the script in the page, and all dynamic contents are acquired.

In operation S340, static content and dynamic content of the current web page are identified to determine whether the current web page is a sensitive web page.

According to the embodiment of the disclosure, for example, the static content and the dynamic content of the current webpage and the sensitive words in the sensitive word set can be subjected to sensitivity feature matching to obtain a matching result, then the current webpage is subjected to sensitivity scoring according to the matching result to obtain a scoring result, and finally whether the current webpage is the sensitive webpage or not is determined according to the scoring result.

According to the embodiment of the disclosure, the sensitive word set may include a plurality of sensitive words, and the assigned weight of each sensitive word may be the same or different.

According to the embodiment of the disclosure, for example, if the static content of the current webpage includes the sensitive content identical to the sensitive words in the sensitive word set, sensitivity scoring can be performed according to the number of the sensitive contents and the weights of the corresponding sensitive words, and finally, whether the current webpage is the sensitive webpage or not is determined according to the scoring result.

According to the embodiment of the disclosure, if it is determined that the script to be executed does not exist in the current webpage, the static content of the current webpage is directly identified.

According to an embodiment of the present disclosure, the sensitive word set may include at least a first sensitive word set and a second sensitive word set, the first sensitive word set being assigned with a first weight value, the second sensitive word set being assigned with a second weight value. The weights of all the sensitive words in the first sensitive word set may be first weight values, and the weights of all the sensitive words in the second sensitive word set may be second weight values. The sensitive words in the first sensitive word set are different from the sensitive words in the second sensitive word set, and the first sensitive word set and the second sensitive word set are respectively used for matching the contents of different parts in the webpage. According to the embodiment of the disclosure, the corresponding sensitive word set can be called to match the content of the corresponding part in the webpage according to the page tag of the webpage.

It should be noted that the number of the sensitive word sets is not limited in the present disclosure, and the number of the sensitive word sets may be determined according to actual requirements. For example, in an actual process, the set of sensitive words may further include a third set of sensitive words, and the weight of all the sensitive words in the third set of sensitive words may be a third weight value. According to an embodiment of the present disclosure, the first weight value, the second weight value, and the third weight value may be set according to a history experience of a manager.

According to the embodiment of the disclosure, for example, for a webpage asset, 3 sets of sensitive words may be maintained, which respectively face to URL, title, and body of a page, that is, a set of URL sensitive words, a set of title sensitive words, and a set of body sensitive words, and different sets of sensitive words may be assigned with different weights. When sensitivity feature matching is performed on static content and dynamic content of the current webpage, sensitivity feature matching can be performed on the page content of the current webpage and corresponding URL sensitive word sets, title sensitive word sets and body sensitive word sets.

By the embodiment of the disclosure, the sensitive webpage assets can be monitored and managed in a key mode, and the network asset mapping capability is enhanced.

FIG. 4 schematically illustrates a flow chart for identifying static and dynamic content of a current web page to determine whether the current web page is a sensitive web page according to an embodiment of the disclosure.

As shown in fig. 4, identifying static content and dynamic content of the current web page to determine whether the current web page is a sensitive web page includes operations S410 to S450.

In operation S410, page content of a current webpage is divided into at least a first part and a second part according to a page tag, wherein the page content includes static content and dynamic content.

According to an embodiment of the present disclosure, for example, the page content of the current webpage may be divided into two parts, title and body, according to the page tags < title > and < body >. The first set of sensitive words may be a title set of sensitive words and the second set of sensitive words may be a body set of sensitive words.

According to an embodiment of the present disclosure, for example, a sensitive word set for a page body is exemplified as follows: 'logic', 'apache', 'tomcat', 'nginx', 'h 3 c', 'manage', 'password', 'username', 'management', 'background', 'platform', 'username', 'password', 'device', 'outlook windows'.

In operation S420, sensitivity feature matching is performed on the first portion and the sensitive words in the first sensitive word set to obtain a first matching result.

According to an embodiment of the present disclosure, for example, the content in the title part is sensitivity feature matched with the sensitive words in the title sensitive word set. According to the embodiment of the disclosure, if the sensitive content matched with the sensitive word exists, which sensitive content matched with the sensitive word in the current webpage specifically contains can be determined according to the matching result, and then the position of the sensitive content in the current webpage is determined.

In operation S430, sensitivity feature matching is performed on the second portion and the sensitive words in the second sensitive word set, so as to obtain a second matching result.

According to an embodiment of the present disclosure, for example, the content in the body part is sensitivity feature matched with the sensitive words in the body sensitive word set.

In operation S440, sensitivity scoring is performed on the current webpage according to the first matching result, the second matching result, the first weight value, and the second weight value, so as to obtain a scoring result.

According to an embodiment of the present disclosure, for example, the first weight value is 8 points, and the second weight value is 2 points. The first matching result shows that the first part of content of the page comprises 1 content matched with the sensitive words in the first sensitive word set, and the second matching result shows that the second part of content of the page comprises 3 content matched with the sensitive words in the second sensitive word set. When the current webpage is subjected to sensitivity scoring, the score of 14 can be calculated by combining the position with the first weight value of 8-1 + the position with the second weight value of 2-3, and the score is used as the scoring result of the current webpage.

In operation S450, it is determined whether the current web page is a sensitive web page according to the scoring result.

According to an embodiment of the present disclosure, the evaluation threshold may be set in advance, for example, 20 is set as the evaluation threshold. And if the scoring result of the webpage is greater than or equal to the evaluation threshold, determining that the current webpage is a sensitive webpage. And if the scoring result of the webpage is less than the evaluation threshold, determining that the current webpage is not the sensitive webpage.

FIG. 5 schematically shows a schematic diagram of a sensitive page according to an embodiment of the disclosure.

As shown in fig. 5, the page is a server 500 page. Through the multi-mode matching algorithm, the fact that an 'apache' sensitive word exists in the page body is found, the page state code is 500(500 represents that a server error exists), and the page is a sensitive page due to the fact that the call stack information of the server side is exposed and potential safety hazards exist.

According to the embodiment of the disclosure, if the body sensitive word weight value is 1, the URL sensitive word weight value is 8, and the title sensitive word weight value is 10, since the sensitive word is found only in the body of the page, and no sensitive word is found in the URL and the title, the page is accumulated to obtain 1 point. According to embodiments of the present disclosure, web page scores and sensitive word positions may be returned to the caller.

According to the embodiment of the disclosure, the position of the sensitive content in the current webpage can be obtained, and the scoring result of the current webpage and the position of the sensitive content in the current webpage are sent to the caller. Therefore, the caller can quickly locate the sensitive content in the webpage, thereby improving the timeliness of the operation work and reducing the workload increased by checking the sensitive content in the operation and maintenance process.

FIG. 6 schematically illustrates a block diagram of an apparatus for identifying a sensitive web page according to an embodiment of the present disclosure.

As shown in fig. 6, an apparatus 600 for identifying a sensitive web page includes: an acquisition module 610, a determination module 620, and an identification module 630.

The obtaining module 610 is configured to obtain a network asset in a target network space, where the network asset includes a plurality of types of assets.

The determining module 620 is used to determine web page assets from the network assets.

The identification module 630 is configured to identify a web page in the web page asset to determine a sensitive web page in the web page asset.

According to an embodiment of the present disclosure, the recognition module 630 includes a loading unit, a parsing unit, an execution unit, and a first recognition unit.

The loading unit is used for loading static content pointed by a uniform resource address corresponding to the current webpage to be identified by using a headless browser aiming at each webpage in the webpage assets.

The analysis unit is used for analyzing the page structure of the static content to determine whether the current webpage has the script to be executed.

The execution unit is used for executing the script to be executed and acquiring the dynamic content under the condition that the script to be executed exists in the current webpage.

The first identification unit is used for identifying the static content and the dynamic content of the current webpage so as to determine whether the current webpage is a sensitive webpage.

According to the embodiment of the disclosure, the first identification unit is used for carrying out sensitivity characteristic matching on static content and dynamic content of a current webpage and sensitive words in a sensitive word set to obtain a matching result; according to the matching result, sensitivity scoring is carried out on the current webpage to obtain a scoring result; and determining whether the current webpage is a sensitive webpage or not according to the scoring result.

According to the embodiment of the disclosure, the sensitive word set at least comprises a first sensitive word set and a second sensitive word set, the first sensitive word set is assigned with a first weight value, the second sensitive word set is assigned with a second weight value, the sensitive words in the first sensitive word set are different from the sensitive words in the second sensitive word set, and the first sensitive word set and the second sensitive word set are respectively used for matching the contents of different parts in the webpage.

The method comprises the following steps of carrying out sensitivity characteristic matching on static content and dynamic content of a current webpage and sensitive words in a sensitive word set, and obtaining a matching result, wherein the matching result comprises the following steps: dividing the page content of the current webpage into at least a first part and a second part according to a page tag, wherein the page content comprises static content and dynamic content; carrying out sensitivity characteristic matching on the first part and the sensitive words in the first sensitive word set to obtain a first matching result; and performing sensitivity characteristic matching on the second part and the sensitive words in the second sensitive word set to obtain a second matching result.

And according to the matching result, carrying out sensitivity scoring on the current webpage, wherein the step of obtaining the scoring result comprises the following steps: and according to the first matching result, the second matching result, the first weight value and the second weight value, scoring the sensitivity of the current webpage to obtain a scoring result.

According to the embodiment of the disclosure, the first identification unit is further configured to determine the sensitive content matched with the sensitive word in the current webpage according to the matching result; acquiring the position of sensitive content in a current webpage; and sending the scoring result of the current webpage and the position of the sensitive content in the current webpage to a caller.

According to an embodiment of the present disclosure, the determining module 620 includes a second identifying unit and a determining unit.

The second identification unit is used for identifying the protocol type of each type of assets in the network assets.

The determining unit is used for determining the assets of which the protocol type belongs to the hypertext transfer protocol type as the web page assets.

Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.

For example, any of the obtaining module 610, the determining module 620, and the identifying module 630 may be combined and implemented in one module, or any one of them may be split into a plurality of modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the present disclosure, at least one of the obtaining module 610, the determining module 620, and the identifying module 630 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware by any other reasonable manner of integrating or packaging a circuit, or may be implemented in any one of or a suitable combination of software, hardware, and firmware. Alternatively, at least one of the obtaining module 610, the determining module 620 and the identifying module 630 may be at least partially implemented as a computer program module, which when executed, may perform a corresponding function.

There is also provided, in accordance with an embodiment of the present disclosure, a computer system, including one or more processors; a readable storage medium storing one or more programs which, when executed by the one or more processors, cause the one or more processors to implement a method for identifying sensitive web pages as described above.

There is also provided, in accordance with an embodiment of the present disclosure, a computer program product including executable instructions that, when executed by a processor, cause the processor to implement the method for identifying sensitive web pages as described above.

FIG. 7 schematically illustrates a block diagram of a computer system suitable for implementing the above-described method according to an embodiment of the present disclosure. The computer system illustrated in FIG. 7 is only one example and should not impose any limitations on the scope of use or functionality of embodiments of the disclosure.

As shown in fig. 7, computer system 700 includes a processor 710 and a readable storage medium 720. The computer system 700 may perform a method according to an embodiment of the disclosure.

In particular, processor 710 may comprise, for example, a general purpose microprocessor, an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), and/or the like. The processor 710 may also include on-board memory for caching purposes. Processor 710 may be a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

Readable storage medium 720, for example, may be a non-volatile computer-readable storage medium, specific examples including, but not limited to: magnetic storage devices, such as magnetic tape or Hard Disk Drives (HDDs); optical storage devices, such as compact disks (CD-ROMs); a memory, such as a Random Access Memory (RAM) or a flash memory; and so on.

The readable storage medium 720 may include a computer program 721, which computer program 721 may include code/computer-executable instructions that, when executed by the processor 710, cause the processor 710 to perform a method according to an embodiment of the disclosure, or any variation thereof.

The computer program 721 may be configured with, for example, computer program code comprising computer program modules. For example, in an example embodiment, code in computer program 721 may include one or more program modules, including 721A, modules 721B, … …, for example. It should be noted that the division and number of modules are not fixed, and those skilled in the art may use suitable program modules or program module combinations according to actual situations, so that the processor 710 may execute the method according to the embodiment of the present disclosure or any variation thereof when the program modules are executed by the processor 710.

According to an embodiment of the present invention, at least one of the obtaining module 610, the determining module 620 and the identifying module 630 may be implemented as a computer program module described with reference to fig. 7, which, when executed by the processor 710, may implement the respective operations described above.

The present disclosure also provides a readable storage medium, which may be contained in the device/apparatus/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The readable storage medium carries one or more programs which, when executed, implement a method according to an embodiment of the disclosure.

According to embodiments of the present disclosure, the readable storage medium may be a non-volatile computer readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Those skilled in the art will appreciate that various combinations and/or combinations of features recited in the various embodiments and/or claims of the present disclosure can be made, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments and/or claims of the present disclosure may be made without departing from the spirit or teaching of the present disclosure. All such combinations and/or associations are within the scope of the present disclosure.

While the disclosure has been shown and described with reference to certain exemplary embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the disclosure as defined by the appended claims and their equivalents. Accordingly, the scope of the present disclosure should not be limited to the above-described embodiments, but should be defined not only by the appended claims, but also by equivalents thereof.

Claims

1. A method for identifying sensitive web pages, comprising:

acquiring network assets in a target network space, wherein the network assets comprise a plurality of types of assets;

determining web page assets from the network assets; and

identifying web pages in the web page assets to determine sensitive web pages in the web page assets.

2. The method of claim 1, wherein identifying web pages in the web page assets to determine sensitive web pages in the web page assets comprises:

aiming at each webpage in the webpage assets, loading static content pointed by a uniform resource address corresponding to the current webpage to be identified by using a headless browser;

analyzing the page structure of the static content to determine whether a script to be executed exists in the current webpage;

executing the script to be executed under the condition that the script to be executed exists in the current webpage, and acquiring dynamic content; and

and identifying the static content and the dynamic content of the current webpage to determine whether the current webpage is a sensitive webpage.

3. The method of claim 2, wherein identifying static and dynamic content of the current web page to determine whether the current web page is a sensitive web page comprises:

carrying out sensitivity characteristic matching on the static content and the dynamic content of the current webpage and the sensitive words in the sensitive word set to obtain a matching result;

according to the matching result, sensitivity scoring is carried out on the current webpage to obtain a scoring result; and

and determining whether the current webpage is a sensitive webpage or not according to the scoring result.

4. The method of claim 3, wherein the set of sensitive words comprises at least a first set of sensitive words and a second set of sensitive words, the first set of sensitive words being assigned a first weight value, the second set of sensitive words being assigned a second weight value, the sensitive words in the first set of sensitive words and the sensitive words in the second set of sensitive words being different, the first set of sensitive words and the second set of sensitive words being used to match content of different portions of a web page, respectively, wherein:

and matching the sensitivity characteristics of the static content and the dynamic content of the current webpage with the sensitive words in the sensitive word set to obtain a matching result, wherein the matching result comprises the following steps:

dividing the page content of the current webpage into at least a first part and a second part according to a page tag, wherein the page content comprises the static content and the dynamic content; and

carrying out sensitivity characteristic matching on the first part and the sensitive words in the first sensitive word set to obtain a first matching result;

carrying out sensitivity characteristic matching on the second part and the sensitive words in the second sensitive word set to obtain a second matching result;

and carrying out sensitivity scoring on the current webpage according to the matching result, wherein the step of obtaining a scoring result comprises the following steps:

and according to the first matching result, the second matching result, the first weight value and the second weight value, carrying out sensitivity scoring on the current webpage to obtain a scoring result.

5. The method of claim 3 or 4, further comprising:

determining sensitive content matched with the sensitive words in the current webpage according to the matching result;

acquiring the position of the sensitive content in the current webpage; and

and sending the scoring result of the current webpage and the position of the sensitive content in the current webpage to a caller.

6. The method of claim 1, wherein determining web page assets from the network assets comprises:

identifying a protocol type for each type of asset in the network assets; and

an asset whose protocol type belongs to the hypertext transfer protocol type is determined as a web page asset.

7. An apparatus for identifying sensitive web pages, comprising:

an acquisition module for acquiring network assets in a target network space, wherein the network assets include multiple types of assets;

a determining module for determining web page assets from the network assets; and

and the identification module is used for identifying the web pages in the web page assets so as to determine the sensitive web pages in the web page assets.

8. A computer system, comprising:

one or more processors;

a readable storage medium for storing one or more programs,

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-6.

9. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 6.

10. A computer program product comprising executable instructions which, when executed by a processor, cause the processor to carry out the method of any one of claims 1 to 6.