CN112036412A - Webpage identification method, device, equipment and storage medium - Google Patents

Webpage identification method, device, equipment and storage medium Download PDF

Info

Publication number
CN112036412A
CN112036412A CN202010882588.3A CN202010882588A CN112036412A CN 112036412 A CN112036412 A CN 112036412A CN 202010882588 A CN202010882588 A CN 202010882588A CN 112036412 A CN112036412 A CN 112036412A
Authority
CN
China
Prior art keywords
webpage
pixel values
screenshot
value
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010882588.3A
Other languages
Chinese (zh)
Inventor
张龙
何恐
张晓峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nsfocus Technologies Inc
Nsfocus Technologies Group Co Ltd
Original Assignee
Nsfocus Technologies Inc
Nsfocus Technologies Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nsfocus Technologies Inc, Nsfocus Technologies Group Co Ltd filed Critical Nsfocus Technologies Inc
Priority to CN202010882588.3A priority Critical patent/CN112036412A/en
Publication of CN112036412A publication Critical patent/CN112036412A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application provides a webpage identification method, a webpage identification device and a webpage identification storage medium, relates to the technical field of computers, and is used for realizing automatic detection of webpage types and reducing complexity of webpage identification. The method comprises the following steps: acquiring a webpage screenshot of the webpage to be identified according to the uniform resource locator URL of the webpage to be identified; determining an image segmentation line of the webpage screenshot according to the difference between pixel values of each row or each column of the webpage screenshot, and performing image segmentation on the webpage screenshot according to the image segmentation line to obtain at least one target detection picture; determining the probability that each target detection picture in at least one target detection picture belongs to a target picture type, and determining the probability that the webpage to be identified belongs to a target webpage type according to the probability of each target picture; and when the probability that the webpage to be identified belongs to the target webpage type is greater than the set probability threshold value, determining the type of the webpage to be identified as the target webpage type.

Description

Webpage identification method, device, equipment and storage medium
Technical Field
The application relates to the technical field of computers, and provides a webpage identification method, a webpage identification device, webpage identification equipment and a webpage identification storage medium.
Background
When a user browses a web page, some web pages containing bad information may exist in the network, which is not beneficial to the physical and mental health of a specific user, and therefore, the illegal web pages need to be limited.
Currently, the most common method for these bad web pages is the blacklist method, which adds the known web page addresses or domain names containing bad information to the blacklist address library by manual means, and further limits the web page addresses and related information accessed by the user by comparing the web pages accessed by the user with the blacklist addresses and keywords.
However, the above-mentioned blacklist method cannot limit undiscovered and newly added bad web pages.
Disclosure of Invention
The embodiment of the application provides a webpage identification method, a webpage identification device and a webpage identification storage medium, which are used for realizing automatic detection of webpage types and reducing the complexity of webpage identification.
In one aspect, a method for identifying a web page is provided, and the method includes:
acquiring a webpage screenshot of the webpage to be identified according to the uniform resource locator URL of the webpage to be identified;
determining an image segmentation line of the webpage screenshot according to the difference between pixel values of each row or each column of the webpage screenshot, and performing image segmentation on the webpage screenshot according to the image segmentation line to obtain at least one target detection picture;
determining the probability that each target detection picture in at least one target detection picture belongs to a target picture type, and determining the probability that the webpage to be identified belongs to a target webpage type according to the probability of each target picture; and when the probability that the webpage to be identified belongs to the target webpage type is greater than the set probability threshold value, determining the type of the webpage to be identified as the target webpage type.
Optionally, determining the probability that the webpage to be identified belongs to the target webpage type according to the probability of each target picture includes:
and determining the probability that the webpage to be identified belongs to the type of the target webpage according to the probability of each target picture and the weight value set for each target picture.
In one aspect, an apparatus for web page identification, the apparatus comprising:
the acquisition unit is used for acquiring a webpage screenshot of the webpage to be identified according to the uniform resource locator URL of the webpage to be identified;
the determining unit is used for determining an image dividing line of the webpage screenshot according to the difference degree between pixel values of each row or each column of the webpage screenshot;
the image segmentation unit is used for carrying out image segmentation on the webpage screenshot according to the image segmentation line to obtain at least one target detection picture;
the determining unit is further configured to determine a probability that each target detection picture in the at least one target detection picture belongs to a target picture type, and determine a probability that the webpage to be identified belongs to the target webpage type according to the probability of each target picture; and when the probability that the webpage to be identified belongs to the target webpage type is greater than the set probability threshold value, determining the type of the webpage to be identified as the target webpage type.
Optionally, the obtaining unit is configured to:
sending a URL access request to a webpage server corresponding to the URL according to the URL;
receiving a hypertext markup language (HTML) document corresponding to the URL returned by the webpage server;
and analyzing the HTML document, rendering according to the content obtained by analyzing the HTML document, and acquiring the webpage screenshot.
Alternatively to this, the first and second parts may,
the determining unit is used for determining an image dividing line of the webpage screenshot in a first direction according to the difference degree between pixel values in each group of pixel values in the first direction in the webpage screenshot, wherein the first direction is a row direction or a column direction;
the image segmentation unit is used for carrying out image segmentation on the webpage screenshot based on the image segmentation lines in the first direction to obtain a plurality of segmentation pictures;
the determining unit is used for selecting a target segmentation picture from the segmentation pictures and determining an image segmentation line of the target segmentation picture in a second direction, wherein the second direction is a column direction when the first direction is a row direction, or the second direction is a row direction when the first direction is a column direction;
the image segmentation unit is used for carrying out image segmentation on the target segmentation picture based on the image segmentation line in the second direction to obtain a plurality of segmentation pictures;
the determining unit is configured to determine, as the at least one target detection picture, a plurality of divided pictures divided based on the image dividing line in the first direction and a plurality of divided pictures divided based on the image dividing line in the second direction.
Optionally, the determining unit is configured to:
determining the difference degree between each pixel value in each group of pixel values in the first direction in the webpage screenshot; wherein the first direction is a row direction and each group of pixel values is a row of pixel values, or the first direction is a column direction and each group of pixel values is a column of pixel values;
and when the difference degree of each group of pixel values is smaller than a set difference degree threshold value, determining pixel points corresponding to each group of pixel values as image dividing lines of the webpage screenshot.
Optionally, the determining unit is configured to:
determining the difference degree between each pixel value in each group of pixel values in the first direction in the webpage screenshot;
obtaining a representation sequence of the webpage screenshot in the first direction according to the difference degree of each group of pixel values; wherein, a group of pixel values corresponds to a position in the representation sequence, adjacent groups are adjacent in position in the representation sequence, when the difference degree of each group of pixel values is smaller than a set difference degree threshold value, the sequence value of the position corresponding to each group of pixel values is a first value, when the difference degree of each group of pixel values is larger than or equal to the difference degree threshold value, the sequence value of the position corresponding to each group of pixel values is a second value, and the first value is different from the second value;
and determining an image segmentation line of the webpage screenshot in the first direction according to the representation sequence.
Optionally, the determining unit is configured to:
for each group of pixel values, taking one of the pixel values in each group of pixel values as a reference pixel value, and acquiring the difference degree between the rest pixel values in each group of pixel values and the reference pixel value;
and acquiring the difference degree of each group of pixel values based on the difference degree between the rest pixel values in each group of pixel values and the reference pixel value.
Optionally, the determining unit is configured to:
acquiring a second number of intervals of which the sequence values are continuously second values in the representation sequence;
determining whether the second number is less than or equal to a set number threshold;
if the second quantity is determined to be larger than the set quantity threshold, executing the following circulation process until the second quantity is smaller than or equal to the set quantity value, wherein each circulation process comprises the following steps:
for each interval with the sequence value continuously being the second value, if the length of the interval is smaller than or equal to the length threshold, setting the sequence value of the interval as the first value to obtain a first updating sequence;
for each interval in which the sequence value in the updated representation sequence is continuously the first value, if the length of the interval is less than or equal to the length threshold, setting the sequence value of the interval as the second value to obtain a second updated sequence;
determining whether a second number in the second update sequence is less than or equal to a set number threshold;
if the second number in the second updating sequence is determined to be larger than the set number threshold, entering the next cycle process; alternatively, the first and second electrodes may be,
and if the second number in the second updating sequence is determined to be less than or equal to the set number threshold, ending the circulation.
Optionally, the length threshold is set according to the number of cycles.
Optionally, the determining unit is configured to:
and determining the probability that the webpage to be identified belongs to the type of the target webpage according to the probability of each target picture and the weight value set for each target picture.
In one aspect, a computer device is provided, comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the above methods when executing the computer program.
In one aspect, a computer storage medium is provided having computer program instructions stored thereon that, when executed by a processor, implement the steps of any of the above-described methods.
In one aspect, a computer program product or computer program is provided that includes computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the steps of any of the methods described above.
In the embodiment of the application, a webpage screenshot is obtained according to a URL (uniform resource locator), an image segmentation line of the webpage screenshot is determined according to the difference between pixel values of each row or each column of the webpage screenshot, the webpage screenshot is subjected to image segmentation according to the image segmentation line, at least one target detection picture is obtained, the probability that each target detection picture belongs to the target picture type is determined, the probability that a webpage to be identified belongs to the target webpage type is further determined, and whether the type of the webpage to be identified is the target webpage type is determined. Therefore, according to the webpage screenshot generating method and device, the webpage screenshot can be automatically generated, the webpage screenshot is divided into the multiple pictures to be detected, the type of the webpage is finally determined by integrating the detection results of the multiple pictures, therefore, the identification and detection can still be carried out on the undiscovered or newly added bad webpages, the access of the bad webpages is limited, the safety of network access is improved, and the complexity of webpage identification is reduced.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments or related technologies, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, it is obvious that the drawings in the following description are only the embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
Fig. 1 is a schematic view of an application scenario of a web page identification device according to an embodiment of the present application;
fig. 2 is a schematic flowchart of a web page identification method according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a screenshot of a web page provided in an embodiment of the present application;
FIG. 4 is a schematic diagram of a process for obtaining a representation sequence according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a process for representing sequence transformations provided by an embodiment of the present application;
fig. 6 is a schematic diagram illustrating probability of obtaining a target picture by a screenshot of a web page provided in an embodiment of the present application;
fig. 7 is a schematic diagram of a web page recognition apparatus according to an embodiment of the present application;
fig. 8 is a schematic diagram of a web page identification device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be described clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. In the present application, the embodiments and features of the embodiments may be arbitrarily combined with each other without conflict. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.
Currently, the blacklist method commonly used at present cannot identify a large number of undiscovered and newly added bad webpages for the bad webpages. At present, whether the webpage is a bad webpage or not can be determined by a method of acquiring all pictures in the webpage and identifying and detecting each picture, but the method needs to detect a large amount of picture resources in the webpage, and the resource consumption is high.
Based on this, in the method, a web screenshot is obtained according to a URL, an image partition line of the web screenshot is determined according to a difference between pixel values of each row or each column of the web screenshot, the web screenshot is subjected to image partition according to the image partition line, at least one target detection picture is obtained, so that a probability that each target detection picture belongs to a target picture type is determined, and then a probability that a web page to be identified belongs to the target web page type is determined, so as to determine whether the type of the web page to be identified is the target web page type. Therefore, according to the method and the device for detecting the webpage screenshot, the webpage screenshot is automatically generated, the webpage screenshot is divided into the multiple pictures to be detected, the type of the webpage is finally determined by integrating the detection results of the multiple pictures, so that the identification and detection can still be performed on the undiscovered or newly added bad webpages, and the access of the bad webpages is further limited.
After introducing the design concept of the embodiment of the present application, some simple descriptions are provided below for application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In a specific implementation process, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.
Fig. 1 is a schematic view of an application scenario of a web page identification device according to an embodiment of the present application. The application scenario of web page identification may include the web page identification device 10.
The web page identification device 10 is a computer device having a certain processing capability, and may be a Personal Computer (PC), a notebook computer, a server, or the like, for example. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, middleware service, a domain name service, a security service, a CDN, and a big data and artificial intelligence platform, but is not limited thereto.
The web page identification device 10 includes one or more processors 101, memory 102, and I/O interfaces 103 to interact with other devices, among other things. In addition, the robot arm posture detecting apparatus 10 may further be provided with a database 104, and the database 104 may be configured to store data such as web page identification data, acquired web page addresses, and identification results of each web page involved in the solution provided in the embodiment of the present application. The memory 102 of the web page identification device 10 may store program instructions of the web page identification method provided by the embodiment of the present application, and when the program instructions are executed by the processor 101, the program instructions can be used to implement the steps of the web page identification method provided by the embodiment of the present application to determine whether the web page belongs to the target web page type.
In actual use, when a user browses a web page, some web pages containing bad information may exist in a network, which is not beneficial to physical and mental health of a specific user, and therefore, the illegal web pages need to be limited, so that the illegal web pages can be limited through the web page identification device 10.
In the embodiment of the application, the webpage to be identified may be an input webpage or a webpage in a webpage list prepared in advance, wherein the webpage list may be a webpage corresponding to a webpage address captured from a network; or a web page obtained from the network traffic data, for example, a web page corresponding to the web page address in the web page access request packet is fetched from the network traffic data. The web page identification device 10 determines whether the web page belongs to the target web page type based on the web page identification method provided by the embodiment of the application, and adds the web page belonging to the target web page type to the restriction blacklist to restrict the access of the web pages.
Of course, the method provided in the embodiment of the present application is not limited to be used in the application scenario shown in fig. 1, and may also be used in other possible application scenarios, and the embodiment of the present application is not limited. The functions that can be implemented by each device in the application scenario shown in fig. 1 will be described in the following method embodiments, and will not be described in detail herein. Hereinafter, the method of the embodiment of the present application will be described with reference to the drawings.
Fig. 2 is a schematic flowchart of a web page identification method according to an embodiment of the present application, which can be executed by the web page identification apparatus 10 in fig. 1, and the flow of the method is described as follows.
Step 201: and acquiring the webpage screenshot of the webpage to be identified according to the URL of the webpage to be identified.
In this embodiment of the application, the web page identification device may obtain a URL of a web page to be identified from a web page input by a user, or may also obtain a URL of a web page from a web page in a pre-prepared web page list, where the web page list may be a web page corresponding to a web page address captured from a network, or may also obtain a URL of a web page from a web page obtained from network traffic data, for example, a web page corresponding to a web page address in a web page access request packet is captured from network traffic data.
The webpage identification device can obtain the webpage screenshot of the webpage to be identified by simulating the process of the browser accessing the URL through the URL. Specifically, the URL access request may be sent to a web server corresponding to the URL according to the URL, a hypertext Markup Language (HTML) document returned by the web server is received, the HTML document is analyzed, a web page structure and contents to be displayed at each position in the web page are obtained based on the HTML document, and a web page screenshot of the web page to be identified is generated by rendering according to the web page structure and the contents to be displayed.
In practical use, the webpage identification device can not only obtain the URL through a webpage input by a user, but also crawl the URL through a standard URL lib included in a python library, the webpage identification device accesses the webpage to be identified based on the URL through a webpage access module included in a browser engine to obtain an HTML document of the webpage to be identified, loads the HTML document into a memory and executes various scripts included in the HTML document, and because a visual interface is not required to be provided for the user, various scripts included in the HTML document can be executed in a program background without displaying a graphical interface of the webpage. Specifically, the browser engine may be, for example, a Webkit, and the web page access module may be, for example, phantomjs included in the Webkit.
After the HTML document is executed by the background, a screenshot instruction of a screenshot tool can be called to obtain the webpage screenshot. For example, a screenshot instruction of the selenium calling the Web obtains a webpage screenshot.
In the embodiment of the application, besides screenshot, a form of generating a Portable Document Format (PDF) Document from webpage content may be adopted, and text recognition may be subsequently performed on the PDF Document.
Step 202: determining an image segmentation line of the webpage screenshot according to the difference degree between pixel values of each row or each column of the webpage screenshot, and carrying out image segmentation on the webpage screenshot according to the image segmentation line to obtain at least one target detection picture.
In the embodiment of the application, before the web page identification device performs image segmentation on the web page screenshot of the web page to be identified, the web page screenshot can be subjected to image scaling processing, so that the web page screenshot is kept at a fixed size, and the subsequent web page identification device can perform image processing on the web page screenshot.
Since a large number of bad pictures or videos may be included in an actual bad webpage, and blank areas are often formed between the pictures and the videos, when the webpage is divided, the blank areas can be used as dividing lines to divide the screenshot of the webpage. As shown in fig. 3, a schematic diagram of a web page screenshot provided in an embodiment of the present application is provided, where the web page screenshot includes information such as pictures and texts, and blank areas exist between different pictures or between a picture and a text block, so that the web page screenshot can be divided by the blank areas.
Specifically, in the screenshot of the web page, the pixel values of the margin areas are basically the same or have small difference, so the image dividing line of the screenshot of the web page can be determined according to the difference degree between the pixel values of each row or each column of the screenshot of the web page. In the 2-dimensional webpage screenshot, the pixel distribution direction comprises a row direction and a column direction, and the dividing line can be determined in both the row direction and the column direction, so that the division in one direction can be performed first, and then the division in the other direction can be performed.
Specifically, the image segmentation line of the web page screenshot in the first direction can be determined according to the difference degree between each pixel value in each group of pixel values in the first direction in the web page screenshot, further image-dividing the screenshot based on the image-dividing lines in the first direction to obtain a plurality of divided pictures, then selecting a target segmentation picture from the segmentation pictures, determining an image segmentation line of the target segmentation picture in a second direction, and further image-dividing the target divided picture based on the image-divided lines in the second direction to obtain a plurality of divided pictures, the resultant divided picture being the at least one target detection picture, namely, a plurality of divided pictures obtained by dividing the picture based on the image dividing line in the first direction and a plurality of divided pictures obtained by dividing the picture based on the image dividing line in the second direction are determined as at least one target detection picture.
In the practical application process, the selection of the target segmented picture from the segmented pictures may be a random selection or a designated selection, for example, after 3 segmented pictures are obtained by segmentation, the middle segmented picture may be used as the target segmented picture.
The first direction is a row direction or a column direction, and the second direction is a direction different from the first direction. The second direction is a column direction when the first direction is a row direction, or the second direction is a row direction when the first direction is a column direction.
Since the segmentation process in the first direction and the second direction are similar, the process of image segmentation will be described below by taking the first direction as an example.
In actual implementation, the image segmentation line in the first direction may be determined by determining a degree of difference between pixel values in each group of pixel values in the first direction in the screenshot. Specifically, when determining the degree of difference, for each group of pixel values in the first direction, one of the pixel values in the group of pixel values may be used as a reference pixel value to obtain the degree of difference between the remaining pixel values in each group of pixel values and the reference pixel value, and further, the degree of difference between each group of pixel values may be obtained based on the degree of difference between the remaining pixel values in each group of pixel values and the reference pixel value.
When the first direction is a row direction, each group of pixel values may be a row of pixel values, or when the first direction is a column direction, each group of pixel values may be a column of pixel values.
Specifically, the reference pixel value in the group of pixel values may be, for example, a pixel value in a first row or a first column, and of course, other possible pixel values may also be used, which is not limited in this embodiment of the present application.
After obtaining the difference between the remaining pixel values in each group of pixel values and the reference pixel value, the sum of the difference may be used as the difference of the group of pixel values, or the average of the differences may be calculated as the difference of the group of pixel values.
In the embodiment of the present application, the manner of determining the image segmentation line based on the difference degree may include the following two manners.
First mode
In the embodiment of the application, after the difference degree of each group of pixel values is obtained, whether the difference degree of each group of pixel values is smaller than a set difference degree threshold value or not can be determined, when the difference degree of the group of pixel values is smaller than the set difference degree threshold value, pixel points corresponding to the group of pixel values are determined to be image partition lines of a webpage screenshot, and otherwise, the pixel points corresponding to the group of pixel values are not the image partition lines of the webpage screenshot.
Second mode
In the embodiment of the application, a representation sequence of the screenshot in the first direction is obtained according to the difference of each group of pixel values. Wherein, a group of pixel values corresponds to a position in the representation sequence, adjacent groups are adjacent in position in the representation sequence, when the difference degree of a group of pixel values is less than the set difference degree threshold value, the sequence value of the position corresponding to the group of pixel values is the first value B0When the difference degree of the group of pixel values is greater than or equal to the difference degree threshold value, the sequence value of the position corresponding to the group of pixel values is the second value B1. First value B0And a second value B1At different values, e.g. first value B0Is 0, a second value B1Is 1, or, a first value B0Is 1, the second value B1Is 0.
Specifically, taking the first direction as the row direction as an example, the representation sequence can be obtained by binarizing the screenshot according to the following rules:
Figure BDA0002654547040000111
wherein N is a variance threshold, P(i,w)Is the pixel value of the ith row and the w th column, PiIs a reference pixel value of the ith row, All (| P)(i,w)-Pi|) indicates the degree of difference of the ith row. When the difference degree of the pixel values of the ith row is smaller than the set difference degree threshold value N, the position of the ith row in the representation sequence is B0When the difference degree of the pixel values of the ith row is other conditions, namely greater than or equal to the set difference degree threshold value N, the position of the ith row in the representation sequence is B1
As shown in fig. 4, in order to obtain a process diagram of a representation sequence, fig. 4 also uses a row example, where a left diagram in fig. 4 is a pixel array of a web page screenshot, and a right diagram is a representation sequence corresponding to the web page screenshot, and through the above binarization process, the pixel array of a web page screenshot may be converted into a representation sequence, where each cell of the pixel map represents a pixel point, each pixel point contains a corresponding pixel value, and each cell of the representation sequence represents a binarization value of a row of pixel values corresponding to the pixel map of the web page screenshot.
In the embodiment of the application, after the representation sequence is obtained, an image segmentation line of the webpage screenshot in the first direction may be determined according to the representation sequence. As shown in FIG. 4, B may be included in the representation sequence0The pixel group corresponding to the position of (2) is determined as an image dividing line.
In the embodiment of the application, noise possibly exists in the screenshot of the webpage, so that the image segmentation line is determined inaccurately. For example, the pixel value of a certain line of the screenshot is changed by noise, so that the difference of the pixel values of the line which should be small becomes very large, and the line cannot be an image segmentation line.
Therefore, in the embodiment of the application, after the representation sequence is obtained, the representation sequence can be denoised, the accuracy of determining the image segmentation line is improved, and the accuracy of subsequent image identification is further improved.
Specifically, the web page recognition device may acquire the presentation sequence SvThe middle sequence value is continuously a first value B0Interval R of (1)0Is continuously a second value B1Interval R of (1)1The second number of the finally obtained target segmented pictures may influence the accuracy of subsequently determining that the webpage to be identified belongs to the target webpage type, so that the threshold value C of the number of the finally obtained target segmented pictures can be set to be continuously the second value B1Interval R of (1)1The picture is divided corresponding to the divided object, so that the web page recognition apparatus can judge the second number C1And the magnitude relation with the quantity threshold C.
When the second number C1If the number is larger than the set number threshold value C, the following circulation process is executed until the second number C1Less than or equal to the set number threshold value C, so as to limit the number of the target segmentation pictures and reduce error segmentation caused by noise. Wherein the content of the first and second substances,each cycle may include the following steps:
firstly, for the sequence value to be continuously the second value B1When the length of the interval is less than or equal to the length threshold value T, the sequence value of the interval is set as a first value B0And obtaining a first updating sequence. Wherein the second value B continues in the representation sequence1Is an interval or continues to be a first value B0The length of the interval is the number of positions included in the interval in the sequence, as shown in fig. 4, one position with "1" at the top is an interval, the adjacent second position with "0" can also be an interval, and the 9 positions with 1 at the bottom can also be an interval.
Secondly, the sequence value in the representation sequence for updating is continuously the first value B0If the length of the interval is less than or equal to the length threshold value T, the sequence value of the interval is set as a second value B1And obtaining a second updating sequence.
Finally, whether the second number in the second updating sequence is smaller than or equal to a set number threshold value C is judged. When the second number in the second updating sequence is larger than the set number threshold value C, the number of the target segmentation pictures is more, and therefore the next cycle process is started; otherwise, when the second number in the second update sequence is less than or equal to the set number threshold C, the loop ends.
In this embodiment of the present application, the number threshold C may be set to 3, for example, but may also be set to other possible values, which is not limited in this embodiment of the present application, and the length threshold T may be set according to the number of cycles, for example, the initial length threshold T may be zero, and as the number of cycles increases, the length threshold T sequentially increases, for example, the length threshold T in each cycle may be increased by one on the basis of the number of cycles.
As shown in fig. 5, a schematic process diagram of representation sequence transformation provided in this embodiment of the present application may obtain a new representation sequence by updating an original sequence. Wherein, when the original sequence is updated for the first time, the sequence value is continuously the second value B1Is less than or equal to the length threshold value T, the sequence value of the interval with the length less than or equal to the length threshold value T is set as a first value B0For example, in fig. 5, the first lattice and the third 1 in the original sequence are both replaced with 0 to obtain a first updated sequence, and at the time of the second update, the sequence value in the updated representation sequence is continuously the first value B0If the length of the interval is less than or equal to the length threshold value T, the sequence value of the interval is set as a second value B1For example, in fig. 5, 0 in the seventh table in the first update sequence is replaced with 1 to obtain the second update sequence.
In this embodiment, the webpage identification device may determine that the finally obtained representation sequence is the first value B0And taking the pixel group corresponding to the position as a segmentation pixel group, taking the pixel point corresponding to the pixel group as an image segmentation line of the webpage screenshot in the first direction, and segmenting the webpage screenshot according to the image segmentation line in the first direction to obtain a plurality of segmentation pictures. For example, if the first direction is the row direction, the finally obtained representation sequence may be the first value B0And the lines corresponding to the positions are used as segmentation lines to segment the webpage screenshot.
In the embodiment of the present application, after selecting a target segmented picture from the obtained multiple segmented pictures, to facilitate image processing, the target segmented picture may be subjected to line-row interchange, then, using the same process as the process for determining the image segmentation line in the first direction, an image segmentation line in the second direction is obtained, and the target segmented picture is subjected to image segmentation based on the image segmentation line in the second direction, so as to obtain multiple segmented pictures, thereby finally obtaining the at least one target detection picture.
Step 203: determining the probability that each target detection picture in at least one target detection picture belongs to the target picture type, and determining the probability that the webpage to be identified belongs to the target webpage type according to the probability of each target picture.
In the embodiment of the present application, as shown in fig. 6, a schematic diagram of obtaining a probability of a target picture for a web page screenshot provided in the embodiment of the present application is provided, where after a picture is divided, at least one target detection picture may be obtained, each target detection picture may be detected by using a picture detection model, a probability value of each target picture in m picture types may be obtained, and then a picture type to which the target detection picture belongs may be obtained, for example, a picture type with a highest probability value is a picture type to which the target detection picture belongs. In addition, the probability that each target detection picture belongs to the target picture type can be obtained. For example, the picture type may include normal, pornographic or illegal types, and the target picture type may be a pornographic picture or an illegal picture, for example, and then the probability that the target detection picture belongs to the pornographic picture or the illegal picture may be detected by a pre-trained picture detection model.
Furthermore, the probability that the webpage to be identified belongs to the type of the target webpage can be determined according to the probability of each target detection picture and the preset weight value of each target picture.
Specifically, the probability value of the target detection picture and the corresponding weight value may be weighted and summed to obtain the probability T that the webpage to be identified belongs to the target webpage typesite. The target web page type may be a pornographic web page or an illegal web page, for example, when the target picture type may be a pornographic picture or an illegal picture, which corresponds to the target picture type.
Illustratively, when at least one obtained target picture is 5, 5 target pictures are detected by a picture detection method to obtain 5 different types of probabilities, which are 0.995, 0.001, 0.002, 0.001 and 0.002, respectively, and the weight values corresponding to the 5 target pictures are 0.9, 0.2, 0.4, 0.1 and 0.1, respectively, so that the probability T of the webpage type to which the webpage to be identified belongs can be calculated according to a weighted summation formulasite0.995 × 0.9+0.001 × 0.2+0.002 × 0.4+0.001 × 0.1+0.002 × 0.1, 0.8968, that is, the probability that the web page to be identified belongs to the target web page type is finally found to be 0.8968 through calculation.
Step 204: and when the probability that the webpage to be identified belongs to the target webpage type is greater than the set probability threshold value, determining the type of the webpage to be identified as the target webpage type.
In the embodiment of the application, the type of the webpage to be identified is preset as a targetConfidence threshold T of webpage typeconfidenceWhen the probability T of the type of the webpage to be identified being the target webpage type is obtained through calculationsiteGreater than or equal to the credibility threshold T that the type of the webpage to be identified is the target webpage typeconfidenceAnd then, determining the type of the webpage to be identified as the target webpage type.
In practical use, the credible threshold T can be set according to experienceconfidenceFor example, a confidence threshold T may be setconfidenceCan be set to 0.8, then when the probability T of the type of the obtained webpage to be identified as the target webpage type is calculatedsiteAnd when the type of the webpage to be identified is greater than or equal to 0.8, the webpage identification equipment determines that the type of the webpage to be identified is the target webpage type. For example, when the probability T that the type of the obtained webpage to be identified is the target webpage type is calculatedsiteWhen 0.8968, 0.8968 ≧ 0.8, so it can be determined that the type of the web page to be identified is the target web page type.
To sum up, the embodiment of the present application images the website content through the process of human-simulated visual judgment, and then conveniently divides the website content into one or more image areas, for example, at most 5 image areas for detection, compared with complicated image text recognition, or hundreds of static pornographic picture recognition, or needing to download video resources, and sampling according to frames to perform pornographic picture recognition, the technical solution of the embodiment of the present application is simpler, saves more resources, and has a wider application range.
Based on the same inventive concept, an embodiment of the present application provides a web page identification apparatus, as shown in fig. 7, the apparatus is applied to a web page identification method, and the apparatus includes:
the acquiring unit 701 is used for acquiring a webpage screenshot of the webpage to be identified according to the uniform resource locator URL of the webpage to be identified;
a determining unit 702, configured to determine an image partition line of the web page screenshot according to a difference between pixel values of each row or each column of the web page screenshot;
the image segmentation unit 703 is configured to perform image segmentation on the web screenshot according to the image segmentation line to obtain at least one target detection picture;
the determining unit 702 is further configured to determine a probability that each target detection picture in the at least one target detection picture belongs to a target picture type, and determine a probability that the webpage to be identified belongs to a target webpage type according to the probability of each target picture; and when the probability that the webpage to be identified belongs to the target webpage type is greater than the set probability threshold value, determining the type of the webpage to be identified as the target webpage type.
Optionally, the obtaining unit 701 is configured to:
sending a URL access request to a webpage server corresponding to the URL according to the URL;
receiving a hypertext markup language (HTML) document corresponding to a Uniform Resource Locator (URL) returned by a web server;
and analyzing the HTML document, rendering according to the content obtained by analyzing the HTML document, and acquiring the webpage screenshot.
Alternatively to this, the first and second parts may,
a determining unit 702, configured to determine an image partition line of the web page screenshot in a first direction according to a difference between pixel values in each group of pixel values in the first direction in the web page screenshot, where the first direction is a row direction or a column direction;
an image segmentation unit 703, configured to perform image segmentation on the screenshot based on image segmentation lines in the first direction to obtain a plurality of segmented pictures;
a determining unit 702, configured to select a target divided picture from the divided pictures, and determine an image dividing line of the target divided picture in a second direction, where the second direction is a column direction when the first direction is a row direction, or the second direction is a row direction when the first direction is a column direction;
an image segmentation unit 703 configured to perform image segmentation on the target segmented picture based on the image segmentation line in the second direction to obtain a plurality of segmented pictures;
a determining unit 702 is configured to determine, as at least one target detection picture, a plurality of divided pictures divided based on an image dividing line in a first direction and a plurality of divided pictures divided based on an image dividing line in a second direction.
Optionally, the determining unit 702 is configured to:
determining the difference degree between each pixel value in each group of pixel values in the first direction in the webpage screenshot; wherein, the first direction is a row direction, and each group of pixel values is a row of pixel values, or the first direction is a column direction, and each group of pixel values is a column of pixel values;
and when the difference degree of each group of pixel values is smaller than the set difference degree threshold value, determining the pixel points corresponding to each group of pixel values as image dividing lines of the webpage screenshot.
Optionally, the determining unit 702 is configured to:
determining the difference degree between each pixel value in each group of pixel values in the first direction in the webpage screenshot;
obtaining a representation sequence of the webpage screenshot in the first direction according to the difference degree of each group of pixel values; when the difference degree of each group of pixel values is greater than or equal to the difference degree threshold value, the sequence value of the position corresponding to each group of pixel values is a second value, and the first value is different from the second value;
and determining an image segmentation line of the webpage screenshot in the first direction according to the representation sequence.
Optionally, the determining unit 702 is configured to:
for each group of pixel values, taking one pixel value in each group of pixel values as a reference pixel value, and acquiring the difference degree between the rest pixel values in each group of pixel values and the reference pixel value;
and acquiring the difference degree of each group of pixel values based on the difference degree between the rest pixel values in each group of pixel values and the reference pixel value.
Optionally, the determining unit 702 is configured to:
acquiring a second number of intervals representing that the sequence values in the sequence are continuously second values;
determining whether the second number is less than or equal to a set number threshold;
if the second quantity is determined to be larger than the set quantity threshold, executing the following circulation process until the second quantity is smaller than or equal to the set quantity value, wherein each circulation process comprises the following steps:
for each interval with the sequence value continuously being the second value, if the length of the interval is smaller than or equal to the length threshold, setting the sequence value of the interval as the first value to obtain a first updating sequence;
for each interval in which the sequence value in the updated representation sequence is continuously the first value, if the length of the interval is less than or equal to the length threshold, setting the sequence value of the interval as the second value to obtain a second updated sequence;
determining whether the second number in the second update sequence is less than or equal to a set number threshold;
if the second number in the second updating sequence is determined to be larger than the set number threshold, entering the next cycle process; alternatively, the first and second electrodes may be,
if it is determined that the second number in the second update sequence is less than or equal to the set number threshold, the loop ends.
Optionally, the length threshold is set according to the number of cycles.
Optionally, the determining unit 702 is configured to:
and determining the probability that the webpage to be identified belongs to the type of the target webpage according to the probability of each target picture and the weight value set for each target picture.
The apparatus may be configured to execute the methods shown in the embodiments shown in fig. 2 to fig. 6, and therefore, for functions and the like that can be realized by each functional module of the apparatus, reference may be made to the description of the embodiments shown in fig. 2 to fig. 6, which is not repeated here.
Referring to fig. 8, based on the same technical concept, an embodiment of the present application further provides a web page identification device 80, which may include a memory 801 and a processor 802.
The memory 801 is used for storing computer programs executed by the processor 802. The memory 801 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to use of the computer device, and the like. The processor 802 may be a Central Processing Unit (CPU), a digital processing unit, or the like. The specific connection medium between the memory 801 and the processor 802 is not limited in the embodiment of the present application. In the embodiment of the present application, the memory 801 and the processor 802 are connected by the bus 803 in fig. 8, the bus 803 is represented by a thick line in fig. 8, and the connection manner between other components is merely illustrative and is not limited. The bus 803 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.
The memory 801 may be a volatile memory (volatile memory), such as a random-access memory (RAM); the memory 801 may also be a non-volatile memory (non-volatile memory) such as, but not limited to, a read-only memory (rom), a flash memory (flash memory), a Hard Disk Drive (HDD) or a solid-state drive (SSD), or the memory 801 may be any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory 801 may be a combination of the above memories.
A processor 802 for executing the method performed by the apparatus in the embodiments shown in fig. 2-6 when calling the computer program stored in the memory 801.
In some possible embodiments, various aspects of the methods provided by the present application may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of the methods according to various exemplary embodiments of the present application described above in this specification when the program product is run on the computer device, for example, the computer device may perform the methods performed by the devices in the embodiments shown in fig. 2-6.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: a mobile storage device, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or a part contributing to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a ROM, a RAM, a magnetic or optical disk, or various other media that can store program code.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (10)

1. A method for identifying a web page, the method comprising:
acquiring a webpage screenshot of a webpage to be identified according to a Uniform Resource Locator (URL) of the webpage to be identified;
determining an image segmentation line of the webpage screenshot according to the difference degree between pixel values of each row or each column of the webpage screenshot, and performing image segmentation on the webpage screenshot according to the image segmentation line to obtain at least one target detection picture;
determining the probability that each target detection picture in the at least one target detection picture belongs to a target picture type, and determining the probability that the webpage to be identified belongs to a target webpage type according to the probability of each target picture;
and when the probability that the webpage to be identified belongs to the target webpage type is greater than a set probability threshold value, determining the type of the webpage to be identified as the target webpage type.
2. The method of claim 1, wherein the obtaining the screenshot of the web page to be identified according to the uniform resource locator URL of the web page to be identified comprises:
sending a URL access request to a webpage server corresponding to the URL according to the URL;
receiving a hypertext markup language (HTML) document corresponding to the URL returned by the webpage server;
and analyzing the HTML document, rendering according to the content obtained by analyzing the HTML document, and acquiring the webpage screenshot.
3. The method of claim 1, wherein determining an image segmentation line of the screenshot according to a degree of difference between pixel values of each row or each column of the screenshot, and performing image segmentation on the screenshot according to the image segmentation line to obtain at least one target detection picture, comprises:
determining an image segmentation line of the webpage screenshot in a first direction according to the difference degree between pixel values in each group of pixel values in the webpage screenshot in the first direction, wherein the first direction is a row direction or a column direction;
carrying out image segmentation on the webpage screenshot based on the image segmentation lines in the first direction to obtain a plurality of segmentation pictures;
selecting a target segmentation picture from the segmentation pictures, and determining an image segmentation line of the target segmentation picture in a second direction, wherein the second direction is a column direction when the first direction is a row direction, or the second direction is a row direction when the first direction is a column direction;
performing image segmentation on the target segmentation picture based on the image segmentation lines in the second direction to obtain a plurality of segmentation pictures;
and determining a plurality of segmented pictures segmented based on the image segmentation line in the first direction and a plurality of segmented pictures segmented based on the image segmentation line in the second direction as the at least one target detection picture.
4. The method of claim 3, wherein determining the image split line of the screenshot in the first direction based on a degree of difference between pixel values in each set of pixel values in the first direction in the screenshot comprises:
determining the difference degree between each pixel value in each group of pixel values in the first direction in the webpage screenshot; wherein the first direction is a row direction and each group of pixel values is a row of pixel values, or the first direction is a column direction and each group of pixel values is a column of pixel values;
and when the difference degree of each group of pixel values is smaller than a set difference degree threshold value, determining pixel points corresponding to each group of pixel values as image dividing lines of the webpage screenshot.
5. The method of claim 3, wherein determining the image split line of the screenshot in the first direction based on a degree of difference between pixel values in each set of pixel values in the first direction in the screenshot comprises:
determining the difference degree between each pixel value in each group of pixel values in the first direction in the webpage screenshot;
obtaining a representation sequence of the webpage screenshot in the first direction according to the difference degree of each group of pixel values; wherein, a group of pixel values corresponds to a position in the representation sequence, adjacent groups are adjacent in position in the representation sequence, when the difference degree of each group of pixel values is smaller than a set difference degree threshold value, the sequence value of the position corresponding to each group of pixel values is a first value, when the difference degree of each group of pixel values is larger than or equal to the difference degree threshold value, the sequence value of the position corresponding to each group of pixel values is a second value, and the first value is different from the second value;
and determining an image segmentation line of the webpage screenshot in the first direction according to the representation sequence.
6. The method of claim 4 or 5, wherein determining the degree of difference between pixel values in each set of pixel values in the first direction in the screenshot comprises:
for each group of pixel values, taking one of the pixel values in each group of pixel values as a reference pixel value, and acquiring the difference degree between the rest pixel values in each group of pixel values and the reference pixel value;
and acquiring the difference degree of each group of pixel values based on the difference degree between the rest pixel values in each group of pixel values and the reference pixel value.
7. The method of claim 5, wherein determining an image segmentation line of the screenshot in the first direction from the sequence of representations comprises:
acquiring a second number of intervals of which the sequence values are continuously second values in the representation sequence;
determining whether the second number is less than or equal to a set number threshold;
if the second quantity is determined to be larger than the set quantity threshold, executing the following circulation process until the second quantity is smaller than or equal to the set quantity value, wherein each circulation process comprises the following steps:
for each interval with the sequence value continuously being the second value, if the length of the interval is smaller than or equal to the length threshold, setting the sequence value of the interval as the first value to obtain a first updating sequence;
for each interval in which the sequence value in the updated representation sequence is continuously the first value, if the length of the interval is less than or equal to the length threshold, setting the sequence value of the interval as the second value to obtain a second updated sequence;
determining whether a second number in the second update sequence is less than or equal to a set number threshold;
if the second number in the second updating sequence is determined to be larger than the set number threshold, entering the next cycle process; alternatively, the first and second electrodes may be,
and if the second number in the second updating sequence is determined to be less than or equal to the set number threshold, ending the circulation.
8. An apparatus for identifying a web page, the apparatus comprising:
the system comprises an acquisition unit, a judgment unit and a display unit, wherein the acquisition unit is used for acquiring a webpage screenshot of a webpage to be identified according to a Uniform Resource Locator (URL) of the webpage to be identified and determining an image dividing line of the webpage screenshot according to the difference degree between pixel values of each row or each column of the webpage screenshot;
the determining unit is used for determining the probability that each target detection picture in the at least one target detection picture belongs to the target picture type and determining the probability that the webpage to be identified belongs to the target webpage type according to the probability of each target picture; in addition, when the probability that the webpage to be identified belongs to the target webpage type is greater than a set probability threshold value, determining the type of the webpage to be identified as the target webpage type;
and the image segmentation unit is used for carrying out image segmentation on the webpage screenshot according to the image segmentation line to obtain at least one target detection picture.
9. An apparatus for web page identification, the apparatus comprising:
the memory is used for storing program instructions and the page screenshot of the webpage to be monitored access result page;
a processor for calling program instructions stored in said memory and for executing the steps comprised in the method of any one of claims 1 to 7 in accordance with the obtained program instructions.
10. A storage medium storing computer-executable instructions for causing a computer to perform the steps comprising the method of any one of claims 1-7.
CN202010882588.3A 2020-08-28 2020-08-28 Webpage identification method, device, equipment and storage medium Pending CN112036412A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010882588.3A CN112036412A (en) 2020-08-28 2020-08-28 Webpage identification method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010882588.3A CN112036412A (en) 2020-08-28 2020-08-28 Webpage identification method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112036412A true CN112036412A (en) 2020-12-04

Family

ID=73587280

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010882588.3A Pending CN112036412A (en) 2020-08-28 2020-08-28 Webpage identification method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112036412A (en)

Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080239354A1 (en) * 2007-03-28 2008-10-02 Usui Daisuke Image processing method, image processing apparatus, image forming apparatus, and recording medium
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content
CN101968813A (en) * 2010-10-25 2011-02-09 华北电力大学 Method for detecting counterfeit webpage
CN103455814A (en) * 2012-05-31 2013-12-18 佳能株式会社 Text line segmenting method and text line segmenting system for document images
US20140149855A1 (en) * 2010-10-21 2014-05-29 Uc Mobile Limited Character Segmenting Method and Apparatus for Web Page Pictures
CN106326451A (en) * 2016-08-26 2017-01-11 武汉大学 Method for judging webpage sensing information block based on visual feature extraction
CN108921184A (en) * 2018-04-18 2018-11-30 中国科学院信息工程研究所 A kind of general type of webpage determination method
CN109284613A (en) * 2018-09-30 2019-01-29 北京神州绿盟信息安全科技股份有限公司 Label detection and counterfeit site detecting method, device, equipment and storage medium
CN110442807A (en) * 2019-08-05 2019-11-12 腾讯科技(深圳)有限公司 A kind of webpage type identification method, device, server and storage medium
CN110619075A (en) * 2018-06-04 2019-12-27 阿里巴巴集团控股有限公司 Webpage identification method and equipment
CN110826488A (en) * 2019-11-06 2020-02-21 苏州思必驰信息科技有限公司 Image identification method and device for electronic document and storage equipment
CN110969166A (en) * 2019-12-04 2020-04-07 国网智能科技股份有限公司 Small target identification method and system in inspection scene
CN111369514A (en) * 2020-02-28 2020-07-03 京东方科技集团股份有限公司 Image segmentation mode detection method and device and display device
CN111382383A (en) * 2018-12-28 2020-07-07 广州市百果园信息技术有限公司 Method, device, medium and computer equipment for determining sensitive type of webpage content

Patent Citations (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080239354A1 (en) * 2007-03-28 2008-10-02 Usui Daisuke Image processing method, image processing apparatus, image forming apparatus, and recording medium
CN101872347A (en) * 2009-04-22 2010-10-27 富士通株式会社 Method and device for judging type of webpage
CN101937438A (en) * 2009-06-30 2011-01-05 富士通株式会社 Method and device for extracting webpage content
US20140149855A1 (en) * 2010-10-21 2014-05-29 Uc Mobile Limited Character Segmenting Method and Apparatus for Web Page Pictures
CN101968813A (en) * 2010-10-25 2011-02-09 华北电力大学 Method for detecting counterfeit webpage
CN103455814A (en) * 2012-05-31 2013-12-18 佳能株式会社 Text line segmenting method and text line segmenting system for document images
CN106326451A (en) * 2016-08-26 2017-01-11 武汉大学 Method for judging webpage sensing information block based on visual feature extraction
CN108921184A (en) * 2018-04-18 2018-11-30 中国科学院信息工程研究所 A kind of general type of webpage determination method
CN110619075A (en) * 2018-06-04 2019-12-27 阿里巴巴集团控股有限公司 Webpage identification method and equipment
CN109284613A (en) * 2018-09-30 2019-01-29 北京神州绿盟信息安全科技股份有限公司 Label detection and counterfeit site detecting method, device, equipment and storage medium
CN111382383A (en) * 2018-12-28 2020-07-07 广州市百果园信息技术有限公司 Method, device, medium and computer equipment for determining sensitive type of webpage content
CN110442807A (en) * 2019-08-05 2019-11-12 腾讯科技(深圳)有限公司 A kind of webpage type identification method, device, server and storage medium
CN110826488A (en) * 2019-11-06 2020-02-21 苏州思必驰信息科技有限公司 Image identification method and device for electronic document and storage equipment
CN110969166A (en) * 2019-12-04 2020-04-07 国网智能科技股份有限公司 Small target identification method and system in inspection scene
CN111369514A (en) * 2020-02-28 2020-07-03 京东方科技集团股份有限公司 Image segmentation mode detection method and device and display device

Similar Documents

Publication Publication Date Title
CN111898696B (en) Pseudo tag and tag prediction model generation method, device, medium and equipment
CN108595583B (en) Dynamic graph page data crawling method, device, terminal and storage medium
US20130145255A1 (en) Systems and methods for filtering web page contents
US20090085921A1 (en) Populate Web-Based Content Based on Space Availability
CN109240912B (en) Webpage application performance evaluation method and terminal based on big data analysis
US10878024B2 (en) Dynamic thumbnails
US20170199850A1 (en) Method and system to decrease page load time by leveraging network latency
US9560160B1 (en) Prioritization of the delivery of different portions of an image file
JP2013084259A (en) Gradual visual comparison of web browser screen
CN111311480B (en) Image fusion method and device
CN111199157A (en) Text data processing method and device
CN116452810A (en) Multi-level semantic segmentation method and device, electronic equipment and storage medium
CN113810375B (en) Webshell detection method, device and equipment and readable storage medium
CN112784189A (en) Method and device for identifying page image
CN109684844B (en) Webshell detection method and device, computing equipment and computer-readable storage medium
CN111898544A (en) Character and image matching method, device and equipment and computer storage medium
US8867837B2 (en) Detecting separator lines in a web page
CN112115266A (en) Malicious website classification method and device, computer equipment and readable storage medium
CN116774973A (en) Data rendering method, device, computer equipment and storage medium
US11200284B1 (en) Optimization of feature embeddings for deep learning models
CN112036412A (en) Webpage identification method, device, equipment and storage medium
CN113642642B (en) Control identification method and device
CN115393756A (en) Visual image-based watermark identification method, device, equipment and medium
US11887356B2 (en) System, method and apparatus for training a machine learning model
CN113887375A (en) Text recognition method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination