CN115459946A

CN115459946A - Abnormal webpage identification method, device, equipment and computer storage medium

Info

Publication number: CN115459946A
Application number: CN202210920916.3A
Authority: CN
Inventors: 陈浩扬; 陈鑫; 徐雪芳; 刘坤锐
Original assignee: Guangzhou Xuanwu Wireless Technology Co Ltd
Current assignee: Guangzhou Xuanwu Wireless Technology Co Ltd
Priority date: 2022-08-02
Filing date: 2022-08-02
Publication date: 2022-12-09

Abstract

The invention relates to a method, a device, equipment and a computer storage medium for identifying an abnormal webpage, wherein the method for identifying the abnormal webpage comprises the following steps: s1: acquiring an image to be identified of a webpage to be identified, wherein the image to be identified comprises a webpage image of the webpage to be identified and a sub-link image of a sub-link of the webpage to be identified; s2: acquiring text information in an image to be identified; s3: the text information and the sensitive information base are used for identifying the abnormal webpage, the abnormal webpage is identified according to the identification result of the sensitive information by identifying the webpage to be identified and the sensitive information of the sub-link webpage of the webpage to be identified, the situation that the sensitive information is hidden in the sub-link webpage of the webpage to be identified and cannot be identified can be avoided, and the accuracy of identifying the abnormal webpage is improved.

Description

Abnormal webpage identification method, device, equipment and computer storage medium

Technical Field

The present invention relates to the field of abnormal web page identification technologies, and in particular, to a method, an apparatus, a device, and a computer storage medium for identifying an abnormal web page.

Background

With the development of information technology, a great deal of violent and other sensitive information exists on the network, and the sensitive information causes great harm to teenagers, so that timely identifying the webpage with the sensitive information is very important.

At present, the identification of the web page mainly establishes a model feature library based on the main features of the image, and carries out image identification, voice identification, semantic identification and character identification on the screenshot of the internet website or the audiovisual content; or establishing a text feature library based on the keywords, accurately identifying or widely identifying the text content of the webpage, and then judging whether the webpage contains sensitive information.

However, due to the rapid development of information technology, a lot of sensitive information is often hidden in the sub-links of the web page, and the existing web page identification method cannot identify the sensitive information in the sub-links of the web page, which increases the identification difficulty and the identification accuracy of the abnormal web page.

Disclosure of Invention

Based on this, the present invention provides an identification method, apparatus, device and computer storage medium for an abnormal web page, which identifies the abnormal web page by using whether the web page to be identified and the child link web page thereof contain sensitive information, so as to improve the identification accuracy of the abnormal web page.

A method for identifying abnormal web pages comprises the following steps:

s1: acquiring an image to be identified of a webpage to be identified, wherein the image to be identified comprises a webpage image of the webpage to be identified and a sub-link image of a sub-link of the webpage to be identified;

s2: acquiring text information in an image to be identified;

s3: and identifying the abnormal webpage by using the text information and the sensitive information base.

According to the method for identifying the abnormal webpage, the sensitive information of the webpage to be identified and the sub-link webpage thereof is identified, and the abnormal webpage is identified according to the identification result of the sensitive information, so that the situation that the sensitive information is hidden in the sub-link webpage of the webpage to be identified and cannot be identified can be avoided, and the accuracy of identifying the abnormal webpage is improved.

Further, step S1 includes the steps of:

s11: acquiring a webpage website to be identified and sub-link websites thereof, wherein the number of the sub-link websites is at least one;

s12: calling a header function by using a puppeter headless browser to open a web site of a to-be-identified web page and at least one sub-link web site to obtain the to-be-identified web page and at least one sub-link web page;

s13: and performing rolling screenshot on the webpage to be identified and at least one sub-link webpage to obtain a webpage image and at least one sub-link image, and obtaining an image to be identified by using the webpage image and the at least one sub-link image.

Further, in step S11, the sub-link addresses of the web pages to be identified are obtained through the following steps:

s111: calling a querySelectorall function by using a puppeter headless browser to obtain a hyperlink label of a webpage to be identified;

s112: calling a getAttribute function by using a puppeter headless browser to obtain the hyperlink attribute of the hyperlink label;

s113: acquiring a sub-link website of a webpage to be identified by using the hyperlink attribute, and calling a header function by using a puppeter headless browser to open the sub-link website to obtain a sub-link page;

s114: and (4) repeating the steps S111-S113 aiming at the sub-link pages until the convergence condition is met, and stopping iteration to obtain at least one sub-link website.

Further, the step of performing rolling screenshot on the webpage to be identified and the at least one child link webpage is to call a page screenshot function by using a puppeter headless browser to perform rolling screenshot on the webpage to be identified and the at least one child link webpage, and a fullpage parameter of the page screenshot function is true.

Further, step S2 comprises the following sub-steps:

s21: extracting initial text information of an image to be recognized by using an OCR character recognition method;

s22: and removing the special characters in the initial text information by using a special character library to obtain the text information.

Further, in step S22, the step of removing the special character in the initial text information by using the special character library is to determine whether the initial text information includes the special character in the special character library, and if so, the step of removing the special character in the initial text information.

Further, step S3 comprises the following sub-steps:

s31: acquiring a sensitive information base, wherein the sensitive information base is composed of a plurality of sensitive words or sensitive characters;

s32: and judging whether the text information contains at least one sensitive word or sensitive word in the sensitive information library, and if so, determining that the webpage to be identified is an abnormal webpage.

The invention also provides a device for identifying the abnormal web page, which comprises the following steps:

the image acquisition module is used for acquiring an image to be identified of a webpage to be identified, and the image to be identified comprises a webpage image of the webpage to be identified and a sub-link image of a sub-link of the webpage to be identified;

the text information acquisition module is used for acquiring text information in the image to be identified;

and the identification module is used for identifying the abnormal webpage by utilizing the text information and the sensitive information base.

The invention also provides an identification device of the abnormal web page, which comprises a memory, a processor and a computer program, wherein the computer program is stored in the memory and is configured to be executed by the processor to realize the identification method of the abnormal web page.

The present invention also provides a computer readable storage medium, on which a computer program is stored, the computer program being executed by a processor to implement the method for identifying an abnormal web page according to the present invention.

For a better understanding and practice, the invention is described in detail below with reference to the accompanying drawings.

Drawings

Fig. 1 is a flowchart of an identification method of an abnormal web page according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It should be understood that the embodiments described are only a few embodiments of the present application, and not all embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present application without making any creative effort belong to the protection scope of the embodiments in the present application.

The terminology used in the embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the present application. As used in the examples of this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the application, as detailed in the appended claims. In the description of the present application, it is to be understood that the terms "first," "second," "third," and the like are used solely for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order, nor is it to be construed as indicating or implying relative importance. The specific meaning of the above terms in the present application can be understood by those of ordinary skill in the art as appropriate.

Further, in the description of the present application, "a plurality" means two or more unless otherwise specified. "and/or" describes the association relationship of the associated objects, meaning that there may be three relationships, e.g., a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship.

It is to be understood that the embodiments of the present application are not limited to the precise arrangements described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the embodiments of the present application is limited only by the following claims.

Referring to fig. 1, the present embodiment provides a method for identifying an abnormal web page, including the following steps:

s2: acquiring text information in an image to be identified;

In the present invention, step S1 includes the steps of:

s11: the method comprises the steps of obtaining a webpage address to be identified and at least one sub-link address thereof, wherein the webpage address to be identified is known, for example, when the identification method of the abnormal webpage is applied to an identification platform, the identification platform obtains the webpage address to be identified in an automatic scanning mode.

S12: the webpage to be identified and at least one sub-link website are opened by calling a header function through a puppeter headless browser, and the webpage to be identified and the at least one sub-link webpage are obtained.

S13: the method includes the steps of carrying out rolling screenshot on a webpage to be identified and at least one sub-link webpage to obtain a webpage image and at least one sub-link image, and obtaining the image to be identified by utilizing the webpage image and the at least one sub-link image.

Since the web page to be identified and the child link web page may both include at least one child link, and in order to avoid that sensitive information is hidden in any child link, a web page image of the web page to be identified and child link images of all child link web pages need to be acquired, therefore, web addresses of the web page to be identified and all child link web pages need to be acquired first, and web addresses of the child link web pages are acquired through the following steps:

s111: calling a querySelectorall function by using a puppeter headless browser to obtain a hyperlink label of a webpage to be identified, wherein: the webpage to be identified in the step comprises an initial webpage to be identified and a sub-link webpage, the querySelectorall function is a tag query function, the querySelectorall function is called by a puppeter headless browser, so that hyperlink tags of the webpage to be identified and the sub-link webpage can be obtained, the hyperlink tags are HTML < a > tags, and the hyperlink tags can enable the webpage to be linked from one page to another page.

S112: calling a getAttribute function by using a puppeter headless browser to obtain the hyperlink attribute of the hyperlink label, wherein: the getAttribute function is an attribute acquisition function, the hyperlink attribute of the hyperlink label can be acquired by using the function, the hyperlink attribute is an href attribute, and the attribute is used for specifying the URL of the hyperlink target, that is, in the invention, the value of the hyperlink attribute is the relative or absolute URL of any hyperlink, including the segment identifier and the JavaScript code segment.

S113: obtaining a sub-link website of a webpage to be identified by using the hyperlink attribute, and calling a header function by using a puppeter headless browser to open the sub-link website to obtain the sub-link webpage, wherein: since the attribute value of the hyperlink attribute includes the URL for specifying the target of the hyperlink, the fragment identifier beginning with http or https, which is the child link address, is screened from the hyperlink attribute.

S114: for the child link pages, repeating the steps S111-S113 until the convergence condition is met, and stopping iteration to obtain at least one child link web address, because the web page to be identified and each child link web page may include a child link, the steps S111-S114 need to be repeated for each child link web page, and the convergence condition is: and when the hyperlink label cannot be obtained in the step S111, it indicates that the sub-link web page does not have the sub-link, and at this time, the iteration is stopped, so as to obtain at least one sub-link web address.

In this embodiment, step S2 includes the following sub-steps:

s21: extracting initial text information of an image to be recognized by using an OCR character recognition method, wherein: OCR is optical character recognition, which is a process of recognizing characters of an image to be recognized and then translating shapes into computer characters by a character recognition method; namely, the process of analyzing and processing the image to be recognized and acquiring the character and layout information.

S22: the method comprises the steps of removing special characters in initial text information by using a special character library to obtain text information, wherein when an OCR character recognition method is used for recognizing an image to be recognized, special characters such as '/' and the like in the image to be recognized can be recognized, the special characters such as '/' and the like do not belong to the category of sensitive information, and can interfere with the recognition of the sensitive information, so that the special characters such as '/' and the like in the initial text information need to be removed to improve the accuracy of the recognition of the sensitive information. And when the special character library is used for removing the special characters in the initial text information, each character in the initial text information is matched with the special characters in the special character library, and then each character matched with the special character library in the initial text information is removed, so that the text information of the image to be recognized can be obtained.

In this embodiment, step S3 includes the following sub-steps:

s31: the method comprises the steps of obtaining a sensitive information base, wherein the sensitive information base is formed by a plurality of sensitive words or sensitive words, and in the invention, the sensitive words or the sensitive words are words or words related to violence and other contents.

S32: judging whether the text information contains at least one sensitive word or sensitive word in a sensitive information base, and when the text information contains at least one sensitive word or sensitive word, the webpage to be recognized is an abnormal webpage.

Based on the identification method of the abnormal web page provided by the embodiment, the embodiment further provides an identification device of the abnormal web page, which includes:

the image acquisition module is used for acquiring an image to be identified of the webpage to be identified, and the image to be identified comprises a webpage image of the webpage to be identified and a sub-link image of a sub-link of the webpage to be identified.

And the text information acquisition module is used for acquiring the text information in the image to be identified.

It should be noted that the modules in this embodiment may be specifically implemented by a computer chip or an entity, or implemented by a product with certain functions. For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. It is to be understood that, in implementing the present specification, functions of each module may be implemented in one or more pieces of software and/or hardware, or a module that implements the same function may be implemented by a combination of a plurality of sub-modules or sub-units, or the like. The above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and other divisions may be realized in practice, for example, a plurality of modules or components may be combined or integrated into another system, or some features may be omitted, or not executed.

The invention also provides an identification device of the abnormal web page, which comprises a memory, a processor and a computer program, wherein the computer program is stored in the memory and is configured to be executed by the processor to realize the identification method of the abnormal web page provided by the embodiment.

The present invention also provides a computer readable storage medium, on which a computer program is stored, where the computer program is executed by a processor to implement the method for identifying an abnormal web page provided in this embodiment.

The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, to those skilled in the art, changes and modifications may be made without departing from the spirit of the present invention, and it is intended that the present invention encompass such changes and modifications.

Claims

1. A method for identifying an abnormal webpage is characterized by comprising the following steps:

s2: acquiring text information in an image to be identified;

s3: and identifying the abnormal web pages by using the text information and the sensitive information base.

2. The method for identifying an abnormal web page according to claim 1, wherein the step S1 comprises the steps of:

3. The method for identifying an abnormal web page according to claim 2, wherein in step S11, the child link addresses of the web page to be identified are obtained by the following steps:

4. The method for identifying the abnormal web page according to claim 2, wherein the step of performing the rolling screenshot on the web page to be identified and the at least one child link web page is a step of calling a page screenshot function by using a puppeter headless browser to perform the rolling screenshot on the web page to be identified and the at least one child link web page, and the fullpage parameter of the page screenshot function is true.

5. The method for identifying abnormal web pages according to any one of claims 1 to 4, wherein the step S2 comprises the following sub-steps:

6. The method for identifying the abnormal web page according to claim 5, wherein in step S22, the step of removing the special character in the initial text message by using the special character library is to determine whether the initial text message contains the special character in the special character library, and if so, the step of removing the special character in the initial text message.

7. The method for identifying an abnormal web page according to claim 5, wherein the step S3 comprises the sub-steps of:

s32: and judging whether the text information contains at least one sensitive word or sensitive word in the sensitive information base, wherein when the text information contains at least one sensitive word or sensitive word, the webpage to be identified is an abnormal webpage.

8. An apparatus for identifying an abnormal web page, comprising:

9. An apparatus for identifying an abnormal web page, comprising a memory, a processor, and a computer program stored in the memory and configured to be executed by the processor to implement the method for identifying an abnormal web page according to any one of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored, the computer program being executed by a processor to implement the method of identifying an abnormal web page according to any one of claims 1 to 7.