CN117763257A - Webpage sensitive content marking system based on OCR - Google Patents

Webpage sensitive content marking system based on OCR Download PDF

Info

Publication number
CN117763257A
CN117763257A CN202311710386.0A CN202311710386A CN117763257A CN 117763257 A CN117763257 A CN 117763257A CN 202311710386 A CN202311710386 A CN 202311710386A CN 117763257 A CN117763257 A CN 117763257A
Authority
CN
China
Prior art keywords
ocr
sensitive
sensitive content
webpage
marking system
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311710386.0A
Other languages
Chinese (zh)
Inventor
张海林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dongxin Wangan Shenzhen Technology Co ltd
Original Assignee
Dongxin Wangan Shenzhen Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dongxin Wangan Shenzhen Technology Co ltd filed Critical Dongxin Wangan Shenzhen Technology Co ltd
Priority to CN202311710386.0A priority Critical patent/CN117763257A/en
Publication of CN117763257A publication Critical patent/CN117763257A/en
Pending legal-status Critical Current

Links

Landscapes

  • Character Input (AREA)

Abstract

The invention discloses a webpage sensitive content marking system based on OCR, relates to sensitive content detection technology, and provides a scheme for solving the problems of low accuracy and the like in the prior art. Mainly comprises the following steps: webpage access and data acquisition; pre-filtering sensitive words; preprocessing and segmenting an image; self-adaptive edge detection; OCR recognition; sensitive content filtering and integration; marking and storing. The method has the advantages that the identification accuracy is high; the treatment efficiency is high; the resource utilization is higher; the content coverage is complete; the flexibility and the adaptability are good.

Description

Webpage sensitive content marking system based on OCR
Technical Field
The invention relates to a sensitive content detection technology, in particular to a webpage sensitive content marking system based on OCR.
Background
In the current internet environment, identification and tagging of web page sensitive content is an important part of network security and content administration. The prior art has focused mainly on the analysis and filtering of text content, but has faced significant challenges in processing sensitive information embedded in images. In particular, with the advent of social media and online platforms, more and more sensitive content is hidden in images or dynamically generated content that is difficult to efficiently identify and process by conventional text analysis methods.
In addition, existing OCR (optical character recognition) systems are typically required to be performed on an entire page or image when processing web page content. This is not only inefficient, but also prone to a large number of false positives, especially in the presence of complex web page layouts and diverse visual elements. This approach is highly demanding in terms of computing resources and does not work well when dealing with dynamically loaded content or hidden sensitive information. Therefore, the prior art has the problems of low accuracy, high resource consumption, incapability of effectively processing dynamic content and the like when processing sensitive information in a complex webpage environment.
It follows that the prior art has significant drawbacks in the field of marking of web page sensitive content, especially in terms of processing images and dynamic content. It is therefore important to develop a system that can efficiently and accurately identify and tag sensitive information in web pages, and in particular, to hide and dynamically generate sensitive content.
Disclosure of Invention
The invention aims to provide an OCR-based webpage sensitive content marking system so as to solve the problems in the prior art.
The webpage sensitive content marking system based on OCR comprises a full sensitive word stock; the webpage sensitive content marking system performs the following steps when the webpage sensitive content is identified:
s1, webpage access and data acquisition: acquiring and displaying all hidden contents of the page, and acquiring source codes and a complete webpage screenshot of the whole webpage;
s2, pre-filtering sensitive words: performing preliminary filtering on source codes by using the full-quantity sensitive word stock;
s3, preprocessing and dividing the image: converting the webpage screenshot into a black-and-white image; dividing the pixel by using a size reference of m pixels, and extracting pixel values of a z axis or a y axis where a dividing line is positioned; wherein m is a positive integer;
step S4, self-adaptive edge detection: judging whether the current segmentation axis contains characters or not according to the segmentation variance of the segmentation line pixel values; translating and expanding the dividing line when the dividing line contains characters until the dividing line does not contain characters or the maximum expansion limit is reached;
step S5, OCR recognition: calling an OCR program to perform character recognition on each pixel block of the webpage screenshot; identifying characters and corresponding position information of each pixel block;
s6, filtering and integrating sensitive contents: re-merging the characters and the position information returned by the OCR program, performing secondary filtering on the characters by using a full-scale sensitive word stock, and identifying and recording matched sensitive words and the positions thereof on the original webpage screenshot;
s7, marking and storing: marking all sensitive words and corresponding positions thereof on the original webpage screenshot, storing the marked images according to preset requirements, and returning binary data of the marked images.
In the step S1, all hidden contents on the page are displayed by inserting the anti-hidden script.
In the step S2, after the preliminary filtering, when no sensitive word is filtered and no picture link is detected, the content recognition is ended, and the detection result of the absence of sensitive content in the current web page is fed back; otherwise, the next step is entered.
The checking of picture links is done by regular rules.
In step S3, m=512.
In the step S4, if the division line reaches the maximum expansion limit and still contains text, the axis at the minimum text field is taken as a new division line.
In the step S4, the division line expansion step is 2 pixels.
In said step S4, the maximum extension limit is 20 pixels.
In said step S5, the invocation of the OCR program is a parallel process.
Character recognition is performed at the character level.
The webpage sensitive content marking system based on OCR has the following advantages.
1. The identification accuracy is higher: through advanced self-adaptive edge detection, characters in an image can be recognized more accurately, and the problem of character cutting common in the traditional OCR technology is avoided. The accuracy of character recognition is improved, especially when processing web page images of complex or dense text.
2. The treatment efficiency is higher: the speed of character recognition is increased by adopting the steps of segmenting the screenshot with a larger size and then performing parallel OCR processing. Through the pre-filtering stage, the web pages without sensitive content or picture links are rapidly removed, and invalid processing of the web pages is avoided, so that the overall processing efficiency is further improved.
3. The resource utilization is higher: adaptive edge detection reduces the OCR requirement on the whole image, only processing specific areas if necessary. Optimizing the use of computing resources, reducing the requirements on system resources, and being particularly suitable for environments with limited resources.
4. The content coverage is complete: being able to process hidden content as well as text in images provides a more comprehensive analysis of web page content. Not only can the apparent text content be processed, but also hidden or imaged sensitive information that is difficult to detect by conventional methods can be identified and marked.
5. The flexibility and the adaptability are good: the method considers the diversity and complexity of different webpages and can adapt to various different webpage layouts and content formats. The adaptive edge detection and segmentation process enables the system to remain efficient and accurate when processing different types of web pages.
Drawings
FIG. 1 is a schematic workflow diagram of a web page sensitive content tagging system according to the present invention.
Fig. 2 is a schematic diagram of the principle of dividing a screenshot by the webpage sensitive content marking system according to the present invention.
FIG. 3 is a schematic diagram of the segmentation and extraction of pixel values of the segmentation line of the web page sensitive content marking system according to the present invention.
Detailed Description
The following specific steps are executed when the OCR-based web page sensitive content marking system of the present invention executes OCR recognition, as shown in fig. 1. The construction of the full-quantity sensitive word stock is collected and arranged from a large number of illegal advertisement web pages. The library is used in the present system to screen sensitive content in target HTML source code and OCR results.
Step P01: and (3) entering a target URL, calling a browser to access the URL, and enabling hidden contents of the common user to be visible by using a JavaScript injection mode. After the page is loaded, capturing a screenshot of the complete webpage, taking a binary data format as a parameter SC, and acquiring a webpage source code as a parameter RH.
Step P02: and loading the full sensitive word stock, and filtering the HTML source codes to obtain an actually-existing sensitive word list RW_list. And searching all picture links in the HTML source code by using the regular expression, and generating a picture link list IMG_list.
Step P03: if the sensitive word list RW_list and the picture link list IMG_list are simultaneously empty, the page is indicated to contain no sensitive content to be marked. Directly enter step P10, end the flow and return the information of "no sensitive content found". Otherwise, the next process will be performed.
Step P04: the original screenshot SC is copied and converted to a black and white image (labeled sc_copy), and its length-width parameters sc_width and sc_height are obtained. If both sc_width and sc_height are less than or equal to 1024 pixels, step P06 is directly entered because the image size is small enough to perform efficient OCR recognition without segmentation. If one or both sides of the image are larger than 1024 pixels, the image is divided by taking 512 pixels as a unit to form a plurality of 512x512 pixels image blocks. For image blocks with far right and far bottom less than 512 pixels, the corresponding split lines will be removed. As shown in fig. 2, assuming that the width and height of the screenshot sc_copy are 1776 pixels and 1836 pixels, respectively, according to the segmentation method of 512 pixels, the required segmentation lines are 3 lines in width and height, which are y1, y2, y3, x1, x2 and x3, respectively, but since the right side of y3 and the lower side of x3 are smaller than 512 pixels, the y3 and x3 segmentation lines are removed.
Step P05: and (3) performing variance calculation on paragraphs with 20 pixels serving as units on each dividing line by adopting a parallel processing mode so as to judge whether characters exist. If the variance value is lower than the set threshold C, the segment is considered to have no text; otherwise, it is considered that text may exist and recorded as a text segment. If the whole parting line does not detect the text segment, the position of the parting line is kept unchanged. If a text segment is detected, the parting line is translated. With translation to the right for the x-axis and downward for the y-axis. The translation step is 2 pixels and the complex difference calculation is repeated. This process will continue until the split line is moved 20 pixels or a split line is found for the plain field. Finally, selecting the parting line of the least text field as a new parting line.
As shown in fig. 3, one of the division lines takes a value of 20 pixels as an interval to calculate variance, and above the threshold value C, the segment is considered to be a text segment, and the more the fields above the division line, the more text that may be cut. . The segmentation variance calculation formula:wherein x is i Representing an ith pixel value in the pixel array; mu represents the average value of pixel values in the segment, and the average value calculation formula is:>n represents the total number of pixels in the pixel array.
Step P06: and calling an OCR program in parallel, and carrying out character-level character recognition on the image blocks subjected to edge adjustment and segmentation to acquire characters and position information of the characters on the image blocks.
Step P07: and combining character information and position information returned by OCR in the original screenshot according to the position of the new parting line in the step P05. The recognized text is then filtered using a full-scale sensitive word stock, recording all matching sensitive words owjlist and their specific locations OWP _list on the original screenshot.
Step P08: if the ow_list is empty, indicating that the identified sensitive content is empty or is not successfully identified, entering step P10 with error information, and ending the flow. Otherwise, step P09 is entered.
Step P09: marking all identified sensitive content positions and corresponding sensitive words on the original screenshot. And storing the processed image to a designated position, and returning binary data of the marked screenshot and an identification of successful task.
Step P10: and (5) sorting error information or the identification of successful task, returning a final result and ending the task.
The webpage sensitive content marking system based on OCR in the invention is mainly characterized in that:
1. adaptive image segmentation mechanism: the mechanism automatically determines whether segmentation needs to be performed according to the size of the image, and determines an optimal scheme for segmentation. Processing for large-size images is particularly critical to ensure the efficiency and accuracy of OCR recognition.
2. Variance-based text segment identification and segmentation line optimization: by accurately calculating the pixel variance of a particular section on the segmentation line, the segmentation line position can be dynamically adjusted, thereby significantly reducing the likelihood of text cuts. Plays a central role in improving the accuracy of the identification of the Chinese content in the webpage screenshot.
Accurate integration of OCR results and location information: combining text content recognized by OCR technology with the exact position of the text content in the original image, and carrying out fine filtering according to a sensitive word stock so as to improve the accuracy of the recognition of the sensitive content.
4. Efficient visual marking of sensitive content: the sensitive content is innovatively marked on the original screenshot, so that the identification degree of the information is improved, and the application practicability of the system is enhanced.
It will be apparent to those skilled in the art from this disclosure that various other changes and modifications can be made which are within the scope of the invention as defined in the appended claims.

Claims (10)

1. The webpage sensitive content marking system based on OCR comprises a full sensitive word stock;
the method is characterized in that when the webpage sensitive content marking system identifies the sensitive content of the webpage, the following steps are executed:
s1, webpage access and data acquisition: acquiring and displaying all hidden contents of the page, and acquiring source codes and a complete webpage screenshot of the whole webpage;
s2, pre-filtering sensitive words: performing preliminary filtering on source codes by using the full-quantity sensitive word stock;
s3, preprocessing and dividing the image: converting the webpage screenshot into a black-and-white image; dividing the pixel by using a size reference of m pixels, and extracting pixel values of a z axis or a y axis where a dividing line is positioned; wherein m is a positive integer;
step S4, self-adaptive edge detection: judging whether the current segmentation axis contains characters or not according to the segmentation variance of the segmentation line pixel values; translating and expanding the dividing line when the dividing line contains characters until the dividing line does not contain characters or the maximum expansion limit is reached;
step S5, OCR recognition: calling an OCR program to perform character recognition on each pixel block of the webpage screenshot; identifying characters and corresponding position information of each pixel block;
s6, filtering and integrating sensitive contents: re-merging the characters and the position information returned by the OCR program, performing secondary filtering on the characters by using a full-scale sensitive word stock, and identifying and recording matched sensitive words and the positions thereof on the original webpage screenshot;
s7, marking and storing: marking all sensitive words and corresponding positions thereof on the original webpage screenshot, storing the marked images according to preset requirements, and returning binary data of the marked images.
2. The OCR-based web page sensitive content marking system according to claim 1, wherein in step S1, all hidden contents on a page are presented by inserting an anti-hidden script.
3. The OCR-based web page sensitive content marking system according to claim 1, wherein in the step S2, after the preliminary filtering, when the sensitive word is not filtered and the picture link is not detected, the content recognition is ended, and a detection result that the sensitive content does not exist in the current web page is fed back; otherwise, the next step is entered.
4. The OCR based web page sensitive content marking system of claim 3, wherein the checking of picture links is performed by regular rules.
5. The OCR based web page sensitive content marking system of claim 1, wherein m = 512 in step S3.
6. The OCR based web page sensitive content marking system of claim 1, wherein in step S4, if the split line reaches a maximum expansion limit still contains text, the axis at the minimum text field is taken as a new split line.
7. The OCR based web page sensitive content marking system of claim 6, wherein in step S4, the split line expansion step size is 2 pixels.
8. The OCR based web page sensitive content marking system of claim 7, wherein in step S4, the maximum expansion limit is 20 pixels.
9. The OCR based web page sensitive content marking system of claim 1, wherein in step S5, the invocation of the OCR program is a parallel process.
10. The OCR based web page sensitive content marking system of claim 9, wherein text recognition is at a character level.
CN202311710386.0A 2023-12-11 2023-12-11 Webpage sensitive content marking system based on OCR Pending CN117763257A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311710386.0A CN117763257A (en) 2023-12-11 2023-12-11 Webpage sensitive content marking system based on OCR

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311710386.0A CN117763257A (en) 2023-12-11 2023-12-11 Webpage sensitive content marking system based on OCR

Publications (1)

Publication Number Publication Date
CN117763257A true CN117763257A (en) 2024-03-26

Family

ID=90319210

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311710386.0A Pending CN117763257A (en) 2023-12-11 2023-12-11 Webpage sensitive content marking system based on OCR

Country Status (1)

Country Link
CN (1) CN117763257A (en)

Similar Documents

Publication Publication Date Title
Yanikoglu et al. Pink Panther: a complete environment for ground-truthing and benchmarking document page segmentation
US8965127B2 (en) Method for segmenting text words in document images
US6694053B1 (en) Method and apparatus for performing document structure analysis
Xi et al. A video text detection and recognition system
US8693790B2 (en) Form template definition method and form template definition apparatus
JP2004318879A (en) Automation technology of comparing image content
CN103455806A (en) Document processing device, document processing method and scanner
JP2006067585A (en) Method and apparatus for specifying position of caption in digital image and extracting thereof
CN111291572A (en) Character typesetting method and device and computer readable storage medium
CN111626145B (en) Simple and effective incomplete form identification and page-crossing splicing method
CN112861861B (en) Method and device for recognizing nixie tube text and electronic equipment
CN115828874A (en) Industry table digital processing method based on image recognition technology
CN111738252B (en) Text line detection method, device and computer system in image
Chiu et al. Picture detection in document page images
Akinbade et al. An adaptive thresholding algorithm-based optical character recognition system for information extraction in complex images
CN115713775A (en) Method, system and computer equipment for extracting form from document
CN114565927A (en) Table identification method and device, electronic equipment and storage medium
CN115273108B (en) Automatic collection method and system for artificial intelligent identification
JP4849883B2 (en) Row direction determination program, method and apparatus
CN117763257A (en) Webpage sensitive content marking system based on OCR
Gupta et al. Table detection and metadata extraction in document images
Huang et al. Chinese historic image threshold using adaptive K-means cluster and Bradley’s
JP4194309B2 (en) Document direction estimation method and document direction estimation program
JP7532124B2 (en) Information processing device, information processing method, and program
CN114202761B (en) Information batch extraction method based on picture information clustering

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination