WO2020237799A1 - 一种网站探测的方法和系统 - Google Patents

一种网站探测的方法和系统 Download PDF

Info

Publication number
WO2020237799A1
WO2020237799A1 PCT/CN2019/096173 CN2019096173W WO2020237799A1 WO 2020237799 A1 WO2020237799 A1 WO 2020237799A1 CN 2019096173 W CN2019096173 W CN 2019096173W WO 2020237799 A1 WO2020237799 A1 WO 2020237799A1
Authority
WO
WIPO (PCT)
Prior art keywords
image analysis
edge device
target
analysis result
model
Prior art date
Application number
PCT/CN2019/096173
Other languages
English (en)
French (fr)
Inventor
陈潜森
林汉荣
秦诚
Original Assignee
网宿科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 网宿科技股份有限公司 filed Critical 网宿科技股份有限公司
Priority to EP19917522.5A priority Critical patent/EP3771171A4/en
Priority to US17/028,807 priority patent/US20210004628A1/en
Publication of WO2020237799A1 publication Critical patent/WO2020237799A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/30Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/30Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information
    • H04L63/302Network architectures or network communication protocols for network security for supporting lawful interception, monitoring or retaining of communications or communication related information gathering intelligence information for situation awareness or reconnaissance
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1036Load balancing of requests to servers for services different from user content provisioning, e.g. load balancing across domain name servers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/60Scheduling or organising the servicing of application requests, e.g. requests for application data transmissions using the analysis and optimisation of the required network resources
    • H04L67/63Routing a service request depending on the request content or context
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1001Protocols in which an application is distributed across nodes in the network for accessing one among a plurality of replicated servers
    • H04L67/1004Server selection for load balancing

Definitions

  • This application relates to the field of computer technology, in particular to a method and system for website detection.
  • website supervision is mostly carried out by manual detection.
  • the website party can upload the text image of the website to the website supervision party, and then the network administrator can check the content of the text image Perform manual inspection to determine whether the corresponding website contains illegal content.
  • a method for website detection is provided, the method is applied to an edge computing system, and the edge computing system includes a cloud platform and multiple edge devices deployed in a distributed manner, wherein:
  • the cloud platform receives a website detection request carrying a target URL, and forwards the website detection request to the target edge device corresponding to the target URL;
  • the target edge device obtains a screenshot of the page corresponding to the target URL, analyzes the screenshot of the page based on a preset text recognition algorithm and/or image analysis model, and generates an analysis result;
  • the target edge device feeds back the analysis result to the sender of the website detection request.
  • the analysis of the screenshot of the page based on a preset text recognition algorithm and/or image analysis model to generate an analysis result includes:
  • the target edge device recognizes the text in the screenshot of the page based on OCR technology, and compares the recognized text with the illegal text library based on the AC automata algorithm to generate a text analysis result; and/or,
  • the target edge device detects whether the screenshot of the page contains an illegal image based on the image analysis model, and generates an image analysis result.
  • the method further includes:
  • the target edge device trains the image analysis model according to the image analysis result to update the model parameters of the image analysis model.
  • the target edge device training the image analysis model according to the image analysis result includes:
  • the target edge device trains the image analysis model according to the image analysis result, otherwise discards the image analysis result.
  • the method further includes:
  • the target edge device detects the image analysis result based on a preset image information detection algorithm, and adjusts the image analysis result according to the detection result; or,
  • the target edge device receives a manual adjustment instruction for the image analysis result, and adjusts the image analysis result according to the manual adjustment instruction.
  • the method further includes:
  • the target edge device periodically sends the model parameters of the image analysis model to the cloud platform;
  • the cloud platform periodically updates the model parameters of the image analysis model corresponding to each edge device based on the model parameters of the image analysis model newly uploaded by each edge device;
  • the cloud platform feeds back the corresponding model parameters of the updated image analysis model to each edge device.
  • the edge computing system includes load balancing equipment and multiple cloud platforms
  • the cloud platform Before the cloud platform receives the website detection request carrying the target URL, it further includes:
  • the load balancing device receives the website detection request carrying the target URL, and forwards the website detection request to the target cloud platform according to the running status of the multiple cloud platforms.
  • a system for website detection includes a cloud platform and multiple edge devices deployed in a distributed manner, wherein:
  • the cloud platform is configured to receive a website detection request carrying a target URL, and forward the website detection request to a target edge device corresponding to the target URL;
  • the target edge device is configured to obtain a screenshot of the page corresponding to the target URL, analyze the screenshot of the page based on a preset text recognition algorithm and/or image analysis model, and generate an analysis result;
  • the target edge device is used to feed back the analysis result to the sender of the website detection request.
  • the target edge device is specifically used for:
  • the target edge device is also used for:
  • the target edge device is specifically used for:
  • the image analysis model is trained according to the image analysis result, otherwise the image analysis result is discarded.
  • the target edge device is also used for:
  • the target edge device is also used to periodically send the model parameters of the image analysis model to the cloud platform;
  • the cloud platform is also used to periodically update the model parameters of the image analysis model corresponding to each edge device based on the model parameters of the image analysis model newly uploaded by each edge device, and feed back the corresponding updated image to each edge device. Model parameters of the image analysis model.
  • the system includes load balancing equipment and multiple cloud platforms;
  • the load balancing device is configured to receive a website detection request carrying a target URL, and forward the website detection request to the target cloud platform according to the running status of a plurality of the cloud platforms.
  • a network device in a third aspect, includes a processor and a memory.
  • the memory stores at least one instruction, at least one program, code set or instruction set, the at least one instruction, the at least A piece of program, the code set or the instruction set is loaded and executed by the processor to realize the processing of the edge device in the method for website detection as described in the first aspect.
  • a computer-readable storage medium stores at least one instruction, at least one program, code set, or instruction set, the at least one instruction, the at least one program, the code
  • the set or instruction set is loaded and executed by the processor to realize the processing of the edge device in the method for website detection as described in the first aspect.
  • the cloud platform receives the website detection request carrying the target URL, and forwards the website detection request to the target edge device corresponding to the target URL; the target edge device obtains a screenshot of the page corresponding to the target URL, based on a preset text recognition algorithm And/or the image analysis model analyzes the screenshot of the page and generates the analysis result; the target edge device feeds back the analysis result to the sender of the website detection request.
  • a website needs to be detected, it can be executed by distributed edge devices based on machine algorithms. Compared with the unified manual detection method, it can effectively reduce the detection cost, improve the detection efficiency, and reduce the center load and detection pressure. ; At the same time, because the edge device is closer to the source site of the website, it can reduce the consumption of bandwidth and traffic and shorten the detection delay.
  • FIG. 1 is a schematic diagram of a network architecture of an edge computing system provided by an embodiment of the present application
  • FIG. 2 is a flowchart of a method for website detection provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a network architecture of an edge computing system provided by an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of a network device provided by an embodiment of the present application.
  • the edge computing system may include a cloud platform and multiple edge devices deployed in a distributed manner.
  • the cloud platform can interface with users, uniformly receive website detection requests sent by users, and can forward website detection requests to edge devices after analyzing and encapsulating the website detection requests.
  • the edge device can be any device with a screenshot function and a screenshot recognition function.
  • a screenshot proxy module for implementing the screenshot function and a screenshot analysis module for implementing the screenshot recognition function can be provided.
  • Edge devices can be distributed in different regions and/or different operator networks, and each edge device can be responsible for providing services to users in the region to which it belongs/or operator networks.
  • the edge device can include a processor, a memory, and a transceiver.
  • the processor can be used to perform website detection processing in the following process.
  • the memory can be used to store the data needed during the processing and the data generated, and the transceiver can be used to receive And send relevant data in the process.
  • Step 201 The cloud platform receives a website detection request carrying a target URL (Uniform Resource Locator), and forwards the website detection request to the target edge device corresponding to the target URL.
  • a target URL Uniform Resource Locator
  • the cloud platform of the edge computing system can receive the website detection request carrying the target URL sent by the above-mentioned user, and then parse and encapsulate the website detection request.
  • the cloud platform can determine the target area and target operator network to which the source site of the target URL belongs, and then can select the source of the target URL according to the target area and target operator network
  • the site distance is less than the preset threshold and belongs to the target edge device of the same operator network.
  • the cloud platform can forward the website detection request to the target edge device corresponding to the target URL. It is worth mentioning that different edge devices in the edge computing system can also be used for different types of website detection processing.
  • edge device A is used to detect online shopping websites
  • edge device B is used to detect online reading websites
  • edge device C Used to detect news websites, etc., so that when the cloud platform selects the target edge device, it can first determine all the optional edge devices used to detect the target website type according to the target website type corresponding to the target URL, and then select the target website type. In selecting the edge device, the target edge device is selected according to the above-mentioned target area and the target operator network.
  • Step 202 The target edge device obtains a screenshot of the page corresponding to the target URL, analyzes the screenshot of the page based on a preset text recognition algorithm and/or a picture analysis model, and generates an analysis result.
  • the target edge device after the target edge device receives the website detection request from the cloud platform, it can extract the target URL carried in it, and then use the built-in screenshot proxy module to intercept the page screenshot corresponding to the target URL from the source site of the target URL. At the same time, the target edge device can also analyze the screenshot of the page based on the preset text recognition algorithm and image analysis model to determine whether there are illegal or illegal text or images in the screenshot to generate the analysis result.
  • the analysis of page screenshots may mainly include text analysis and image analysis.
  • the processing of step 202 may be as follows: the target edge device recognizes the text in the page screenshot based on OCR (Optical Character Recognition) technology and The AC automata algorithm compares the recognized text with the illegal text library to generate a text analysis result; and/or the target edge device detects whether the page screenshot contains the illegal image based on the image analysis model, and generates the image analysis result.
  • OCR Optical Character Recognition
  • the target edge device can analyze the text and image content in the page screenshot to determine whether there are illegal or illegal texts or illegal images in the screenshot.
  • the target edge device can use OCR technology to recognize the text in the screenshot of the page, and then use the AC automata algorithm to compare the recognized text with the offending text library to generate text analysis results. It is not difficult to understand that the illegal text can be recorded in the illegal text database. When the text in the illegal text database appears the same as the recognized text, it can be determined that the screenshot of the page contains illegal or illegal text.
  • the target edge device can also continuously update the content of the illegal text library based on the website detection results.
  • the cloud platform can regularly aggregate the content of the illegal text library of all edge devices of this type. Then use the summary content to update the violation text library of each edge device of this type.
  • the target edge device can call a preset image analysis model, and use the image analysis model to perform machine vision analysis on page screenshots to detect whether the page screenshots involve pornographic, politically sensitive, violent and illegal image content. To generate the results of the image analysis.
  • Step 203 The target edge device feeds back the analysis result to the sender of the website detection request.
  • the target edge device analyzes the screenshot of the page corresponding to the target URL, and after generating the analysis result, the analysis result can be fed back to the sender of the website detection request.
  • the user can specify the receiving end of the analysis result in the website detection request, so that the target edge device can send the analysis result to the receiving end after generating the analysis result.
  • the cloud platform can select multiple target edge devices to jointly detect the target URL. In this way, after the target edge device generates the analysis result, it can also feed back the analysis result to the cloud platform first.
  • the cloud platform can summarize the analysis results fed back by all target edge devices, and then feed back the summarized analysis results to the sender of the website detection request.
  • the edge device can also use the image analysis results to perform model enhancement training on the image analysis model to optimize and update the image analysis model.
  • the corresponding processing can be as follows: the target edge device trains the image analysis model according to the image analysis results to update the image analysis Model parameters of the model.
  • each edge device may be provided with a model training module, through which the edge device can continuously optimize the image analysis model on it.
  • a model training module through which the edge device can continuously optimize the image analysis model on it.
  • the target edge device Take the target edge device as an example.
  • the image analysis result can be input into the above model training module, so that the image analysis model can be intensively trained according to the image analysis result to update the image Analyze the model parameters of the model.
  • the function of the model training module can be implemented by another independent model training device, and the model training device can implement the above-mentioned image analysis model training processing by interacting with an edge device.
  • the target edge device will perform the image analysis based on the image analysis result.
  • the analysis model is trained, otherwise the image analysis result is discarded.
  • the target edge device after the target edge device feeds back the analysis result to the sender of the website detection request, it can detect whether the sender has fed back a result confirmation message. If the result confirmation message sent by the sender is received, the target edge device can determine that this image analysis is correct, and then can train the image analysis model based on the image analysis result, and if the result confirmation message is not received, or the result is incorrect Message, the target edge device can discard the image analysis result. At the same time, the target edge device can also update the total number of image analysis errors after receiving the result error message. When the total number reaches the preset number threshold, it can actively suspend the website detection service.
  • the image analysis results can be adjusted to ensure the effectiveness of the model training.
  • the corresponding processing can be as follows: The target edge device is based on the preset image information detection algorithm The image analysis result is detected, and the image analysis result is adjusted according to the detection result; or the target edge device receives a manual adjustment instruction for the image analysis result, and adjusts the image analysis result according to the manual adjustment instruction.
  • the target edge device may first adjust the image analysis results before using the generated image analysis results to train the image analysis model to ensure the correctness of the image analysis results.
  • a picture information detection algorithm can be preset on the target edge device to detect illegal and illegal pictures to confirm whether there are illegal or illegal content in the pictures.
  • the target edge device can detect the image analysis result based on the preset image information detection algorithm, and then adjust the image analysis result according to the detection result.
  • the technical staff of the edge computing system can manually inspect the image analysis results.
  • the technical staff can only analyze that there is illegal
  • the image analysis results of the illegal content are manually checked, and then the edge devices are controlled to adjust the image analysis results by manual adjustment instructions. In this way, after receiving the manual adjustment instruction for the image analysis result, the target edge device can adjust the image analysis result according to the manual adjustment instruction.
  • the cloud platform can also periodically aggregate and update the model parameters of the image analysis model of all edge nodes.
  • the corresponding processing can be as follows: the target edge device periodically sends the model parameters of the image analysis model to the cloud platform; the cloud platform is periodically based on each edge The model parameters of the latest image analysis model uploaded by the device, and the model parameters of the image analysis model corresponding to each edge device are updated; the cloud platform feeds back the model parameters of the corresponding updated image analysis model to each edge device.
  • all edge devices including the target edge device in the edge computing system can periodically send the model parameters of the image analysis model to the cloud platform.
  • the cloud platform can periodically update the model parameters of the image analysis model corresponding to each edge device based on the model parameters of the latest image analysis model uploaded by each edge device, and then feed back the corresponding updated image to each edge device Analyze the model parameters of the model, so as to ensure the accuracy of the model parameters of the image analysis model on each edge device.
  • the cloud platform updates the model parameters of the image analysis model, it can analyze the images of the same type according to the type of responsibility. A unified update is carried out so that the image analysis model can be more targeted and accurately detect the website pages of the corresponding type.
  • the edge computing system may include a load balancing device and multiple cloud platforms, where the load balancing device receives a website detection request carrying a target URL, and detects the website according to the running status of the multiple cloud platforms The request is forwarded to the target cloud platform.
  • multiple cloud platforms may be set up in the edge computing system, and load balancing devices that have been used for load balancing among multiple cloud platforms.
  • the load balancing device can obtain the operating status of multiple cloud platforms in real time, and then can distribute the received website detection request among multiple cloud platforms according to the operating status.
  • the user can send the website detection request to the edge computing system, and the website detection request can be directed to the aforementioned load balancing device.
  • the load balancing device can forward the website detection request to the target cloud platform according to the running status of multiple cloud platforms.
  • the process of selecting the target cloud platform here may be selecting the cloud platform with the lowest load, or selecting the cloud platform with the best performance, or according to other selection principles, which is not limited in this embodiment.
  • the cloud platform receives the website detection request carrying the target URL, and forwards the website detection request to the target edge device corresponding to the target URL; the target edge device obtains a screenshot of the page corresponding to the target URL, based on a preset text recognition algorithm And/or the image analysis model analyzes the screenshot of the page and generates the analysis result; the target edge device feeds back the analysis result to the sender of the website detection request.
  • a website needs to be detected, it can be executed by distributed edge devices based on machine algorithms. Compared with the unified manual detection method, it can effectively reduce the detection cost, improve the detection efficiency, and reduce the center load and detection pressure. ; At the same time, because the edge device is close to the source site of the website, it can reduce the bandwidth consumption and shorten the detection delay.
  • an embodiment of the present application also provides a system for website detection.
  • the system includes a cloud platform and multiple edge devices deployed in a distributed manner, wherein:
  • the cloud platform is configured to receive a website detection request carrying a target URL, and forward the website detection request to a target edge device corresponding to the target URL;
  • the target edge device is configured to obtain a screenshot of the page corresponding to the target URL, analyze the screenshot of the page based on a preset text recognition algorithm and/or image analysis model, and generate an analysis result;
  • the target edge device is used to feed back the analysis result to the sender of the website detection request.
  • the target edge device is specifically used for:
  • the target edge device is also used for:
  • the target edge device is specifically used for:
  • the image analysis model is trained according to the image analysis result, otherwise the image analysis result is discarded.
  • the target edge device is also used for:
  • the target edge device is also used to periodically send the model parameters of the image analysis model to the cloud platform;
  • the cloud platform is also used to periodically update the model parameters of the image analysis model corresponding to each edge device based on the model parameters of the image analysis model newly uploaded by each edge device, and feed back the corresponding updated image to each edge device. Model parameters of the image analysis model.
  • the system includes load balancing equipment and multiple cloud platforms;
  • the load balancing device is configured to receive a website detection request carrying a target URL, and forward the website detection request to the target cloud platform according to the running status of the multiple cloud platforms.
  • the cloud platform receives the website detection request carrying the target URL, and forwards the website detection request to the target edge device corresponding to the target URL; the target edge device obtains a screenshot of the page corresponding to the target URL, based on a preset text recognition algorithm And/or the image analysis model analyzes the screenshot of the page and generates the analysis result; the target edge device feeds back the analysis result to the sender of the website detection request.
  • a website needs to be detected, it can be executed by distributed edge devices based on machine algorithms. Compared with the unified manual detection method, it can effectively reduce the detection cost, improve the detection efficiency, and reduce the center load and detection pressure. ; At the same time, because the edge device is close to the source site of the website, it can reduce the bandwidth consumption and shorten the detection delay.
  • Fig. 4 is a schematic structural diagram of a network device provided by an embodiment of the present application.
  • the network device 400 may have relatively large differences due to different configurations or performance, and may include one or more central processing units 422 (for example, one or more processors) and a memory 432, and one or more storage application programs 442 or The storage medium 430 of the data 444 (for example, one or a storage device in a large amount).
  • the memory 432 and the storage medium 430 may be short-term storage or persistent storage.
  • the program stored in the storage medium 430 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the network device 400.
  • the central processing unit 422 may be configured to communicate with the storage medium 430, and execute a series of instruction operations in the storage medium 430 on the network device 400.
  • the network device 400 may also include one or more power supplies 429, one or more wired or wireless network interfaces 450, one or more input and output interfaces 458, one or more keyboards 456, and/or, one or more operating systems 441, such as Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc.
  • the network device 400 may include a memory and one or more programs. One or more programs are stored in the memory and configured to be executed by one or more processors. The instruction of the edge device in the above website detection.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Technology Law (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)
  • Image Analysis (AREA)

Abstract

本申请部分实施例公开了一种网站检测的方法和系统,属于计算机技术领域。所述方法包括:所述云平台接收携带有目标URL的网站检测请求,将所述网站检测请求转发至所述目标URL对应的目标边缘设备 (201);所述目标边缘设备获取所述目标URL对应的页面截图,基于预设的文字识别算法和/或图片分析模型分析所述页面截图,生成分析结果 (202);所述目标边缘设备向所述网站检测请求的发送端反馈所述分析结果(203)。采用本申请的实施例,可以有效降低网站检测的成本,提高网站检测的效率,减少带宽流量的消耗,以及缩短网站检测的延时。

Description

一种网站探测的方法和系统
交叉引用
本申请引用于2019年05月29日递交的名称为“一种网站探测的方法和系统”的第201910457676.6号中国专利申请,其通过引用被全部并入本申请
技术领域
本申请涉及计算机技术领域,特别涉及一种网站探测的方法和系统。
背景技术
近几年伴随着互联网的飞速发展,互联网上的网站也越来越多,网站的内容也越来越丰富、多样,而很多包含违法违规内容的网站也频繁出现,或者网站受恶意攻击导致网页被劫持、被篡改而出现了违法违规内容。因此,网站监管已成为当前互联网领域的热门需求。
目前网站监管大多采用人工检测的方式进行,当需要对某个网站是否包含违法违规内容进行检测时,网站方可以将网站的文本图片上传至网站监管方,然后由网络管理员对上述文本图片内容进行人工检测,从而判断相应网站中是否包含有违法违规内容。
在实现本申请的过程中,发明人发现现有技术至少存在以下问题:
由于目前网站的数量、内容均不断的增加,需要人工检测的文本及图片数量众多,首先针对大量文本及图片的审核需要消耗大量的人力、时间成本;其次将大量文本及图片上传给网站监管方,带宽流量消耗和检测延时均较高,因此目前网站检测的难度大、效率低、速度慢、成本高。
发明内容
为了解决现有技术的问题,本申请部分实施例提供了一种网站探测的方法和系统。所述技术方案如下。
第一方面,提供了一种网站探测的方法,所述方法应用于边缘计算系统, 所述边缘计算系统包括云平台和分布式部署的多台边缘设备,其中:
所述云平台接收携带有目标URL的网站检测请求,将所述网站检测请求转发至所述目标URL对应的目标边缘设备;
所述目标边缘设备获取所述目标URL对应的页面截图,基于预设的文字识别算法和/或图片分析模型分析所述页面截图,生成分析结果;
所述目标边缘设备向所述网站检测请求的发送端反馈所述分析结果。
例如,所述基于预设的文字识别算法和/或图片分析模型分析所述页面截图,生成分析结果,包括:
所述目标边缘设备基于OCR技术识别所述页面截图中的文字,并基于AC自动机算法将识别出的文字与违规文本库进行比对,生成文本分析结果;和/或,
所述目标边缘设备基于图片分析模型检测所述页面截图中是否包含违规图片,生成图片分析结果。
例如,所述方法还包括:
所述目标边缘设备根据所述图片分析结果对所述图片分析模型进行训练,以更新所述图片分析模型的模型参数。
例如,所述目标边缘设备根据所述图片分析结果对所述图片分析模型进行训练,包括:
如果接收到所述发送端发送的结果确认消息,所述目标边缘设备则根据所述图片分析结果对所述图片分析模型进行训练,否则丢弃所述图片分析结果。
例如,所述目标边缘设备根据所述图片分析结果对所述图片分析模型进行训练之前,还包括:
所述目标边缘设备基于预设的图片信息检测算法对所述图片分析结果进行检测,根据检测结果调整所述图片分析结果;或者,
所述目标边缘设备接收针对所述图片分析结果的人工调整指令,根据所述人工调整指令调整所述图片分析结果。
例如,所述方法还包括:
所述目标边缘设备周期性向所述云平台发送所述图片分析模型的模型参数;
所述云平台周期性基于每台边缘设备最新上传的图片分析模型的模型参数,更新每台边缘设备对应的图片分析模型的模型参数;
所述云平台向每台边缘设备反馈对应的更新后的图片分析模型的模型参数。
例如,所述边缘计算系统包括负载均衡设备和多个云平台;
所述云平台接收携带有目标URL的网站检测请求之前,还包括:
所述负载均衡设备接收携带有目标URL的网站检测请求,根据多个所述云平台的运行状态,将所述网站检测请求转发至目标云平台。
第二方面,提供了一种网站检测的系统,所述系统包括云平台和分布式部署的多台边缘设备,其中:
所述云平台,用于接收携带有目标URL的网站检测请求,将所述网站检测请求转发至所述目标URL对应的目标边缘设备;
所述目标边缘设备,用于获取所述目标URL对应的页面截图,基于预设的文字识别算法和/或图片分析模型分析所述页面截图,生成分析结果;
所述目标边缘设备,用于向所述网站检测请求的发送端反馈所述分析结果。
例如,所述目标边缘设备,具体用于:
基于OCR技术识别所述页面截图中的文字,并基于AC自动机算法将识别出的文字与违规文本库进行比对,生成文本分析结果;和/或,
基于图片分析模型检测所述页面截图中是否包含违规图片,生成图片分析结果。
例如,所述目标边缘设备,还用于:
根据所述图片分析结果对所述图片分析模型进行训练,以更新所述图片分析模型的模型参数。
例如,所述目标边缘设备,具体用于:
如果接收到所述发送端发送的结果确认消息,则根据所述图片分析结果对所述图片分析模型进行训练,否则丢弃所述图片分析结果。
例如,所述目标边缘设备,还用于:
在根据所述图片分析结果对所述图片分析模型进行训练之前,基于预设的图片信息检测算法对所述图片分析结果进行检测,根据检测结果调整所述图片分析结果;或者,
在根据所述图片分析结果对所述图片分析模型进行训练之前,接收针对所述图片分析结果的人工调整指令,根据所述人工调整指令调整所述图片分析结 果。
例如,所述目标边缘设备,还用于周期性向所述云平台发送所述图片分析模型的模型参数;
所述云平台,还用于周期性基于每台边缘设备最新上传的图片分析模型的模型参数,更新每台边缘设备对应的图片分析模型的模型参数,向每台边缘设备反馈对应的更新后的图片分析模型的模型参数。
例如,所述系统包括负载均衡设备和多个云平台;
所述负载均衡设备,用于接收携带有目标URL的网站检测请求,根据多个所述云平台的运行状态,将所述网站检测请求转发至目标云平台。
第三方面,提供了一种网络设备,所述网络设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如第一方面所述的网站探测的方法中边缘设备的处理。
第四方面,提供了一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如第一方面所述的网站探测的方法中边缘设备的处理。
本申请实施例提供的技术方案带来的有益效果是:
本申请实施例中,云平台接收携带有目标URL的网站检测请求,将网站检测请求转发至目标URL对应的目标边缘设备;目标边缘设备获取目标URL对应的页面截图,基于预设的文字识别算法和/或图片分析模型分析页面截图,生成分析结果;目标边缘设备向网站检测请求的发送端反馈分析结果。这样,当需要对网站进行检测时,可以由分布式部署的边缘设备基于机器算法来执行,相对于统一由人工检测的方式,可以有效降低检测成本、提高检测效率,并且减少中心负载和检测压力;同时由于边缘设备距离网站的源站较近,可以减少带宽流量的消耗,缩短检测延时。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申 请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种边缘计算系统的网络架构示意图;
图2是本申请实施例提供的一种网站检测的方法流程图;
图3是本申请实施例提供的一种边缘计算系统的网络架构示意图;
图4是本申请实施例提供的一种网络设备的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请部分实施例作详细描述。
本申请实施例提供了一种网站检测的方法,该方法可以应用于边缘计算系统,如图1所示,该边缘计算系统可以包括云平台和分布式部署的多台边缘设备。其中,云平台可以与用户对接,统一接收用户发来的网站检测请求,并且可以在对网站检测请求进行解析、封装等处理后,将网站检测请求转发至边缘设备。边缘设备可以是具备截图功能和截图识别功能的任意设备,其中具体可以设置有用于实现截图功能的截图代理模块,和用于实现截图识别功能的截图分析模块。边缘设备可以分布式部署在不同区域和/或不同运营商网络内,每台边缘设备可以负责对其所属区域/或运营商网络内的用户提供服务。边缘设备可以包括处理器、存储器、收发器,处理器可以用于进行下述流程中执行网站检测的处理,存储器可以用于存储处理过程中需要的数据以及产生的数据,收发器可以用于接收和发送处理过程中的相关数据。
下面将结合具体实施例,对图2所示的处理流程进行详细的说明,内容可以如下。
步骤201,云平台接收携带有目标URL(Uniform Resource Locator,统一资源定位符)的网站检测请求,将网站检测请求转发至目标URL对应的目标边缘设备。
在实施中,用户需要检测网站中是否包含违法违规内容时,可以向边缘计算系统发送网站检测请求,并在该网站检测请求中添加需要检测的网站页面的URL(即目标URL,可以是一个网站页面URL,也可以是多个网站页面的多个URL)。故而,边缘计算系统的云平台可以接收到上述用户发送的携带有目标 URL的网站检测请求,然后对该网站检测请求进行解析、封装等处理。同时,针对每个目标URL,云平台在获取到目标URL后,可以确定目标URL的源站所属的目标区域和目标运营商网络,然后可以根据目标区域和目标运营商网络选取与目标URL的源站距离小于预设阈值,且属于同一运营商网络的目标边缘设备,进而,云平台可以将网站检测请求转发至目标URL对应的目标边缘设备。值得一提的是,边缘计算系统中不同边缘设备还可以用于负责不同类型的网站检测处理,如边缘设备A用于检测网购类网站,边缘设备B用于检测在线阅读类网站,边缘设备C用于检测新闻类网站等,这样,云平台在选取目标边缘设备时,还可以先根据目标URL对应的目标网站类型,确定用于检测目标网站类型的所有可选边缘设备,然后再在这些可选边缘设备中根据上述目标区域和目标运营商网络选取目标边缘设备。
步骤202,目标边缘设备获取目标URL对应的页面截图,基于预设的文字识别算法和/或图片分析模型分析页面截图,生成分析结果。
在实施中,目标边缘设备在接收到云平台发来的网站检测请求后,可以提取其中携带的目标URL,然后通过内置的截图代理模块从目标URL的源站截取目标URL对应的页面截图。同时,目标边缘设备也可以基于预设的文字识别算法和图片分析模型,对该页面截图进行分析,以判断页面截图中是否存在违法违规的文字或者图片,从而生成分析结果。
例如,对页面截图的分析可以主要包括文字分析和图片分析,相应的,步骤202的处理可以如下:目标边缘设备基于OCR(Optical Character Recognition,光学字符识别)技术识别页面截图中的文字,并基于AC自动机算法将识别出的文字与违规文本库进行比对,生成文本分析结果;和/或,目标边缘设备基于图片分析模型检测页面截图中是否包含违规图片,生成图片分析结果。
在实施中,目标边缘设备在获取到目标URL对应的页面截图后,可以对页面截图中的文字和图片内容分别进行分析,以判断该页面截图中是否存在违法违规文字或违法违规图片。一方面,目标边缘设备可以采用OCR技术对页面截图中的文字进行识别,然后再通过AC自动机算法将识别出的文字与违规文本库进行比对,从而生成文本分析结果。不难理解,违规文本库中可以记录有违法违规的文字,当违规文本库中的文字出现与识别出的文字相同的文字时,则 可以判定该页面截图中包含有违法违规文字。例如,目标边缘设备还可以不断根据网站检测结果来更新违规文本库中的内容,而针对用于检测各类型网站的边缘设备,云平台可以定期汇总该类所有边缘设备的违规文本库的内容,然后利用汇总内容更新该类每台边缘设备的违规文本库。另一方面,目标边缘设备可以调用预设的图片分析模型,利用该图片分析模型对页面截图进行机器视觉分析,以检测页面截图中是否有涉及色情、涉政敏感、暴力恐怖等违法违规图片内容,从而生成图片分析结果。
步骤203,目标边缘设备向网站检测请求的发送端反馈分析结果。
在实施中,目标边缘设备分析目标URL对应的页面截图,生成分析结果后,可以将分析结果反馈至网站检测请求的发送端。当然,用户可以在网站检测请求中指定分析结果的接收端,使得目标边缘设备在生成分析结果后可以将分析结果发送至该接收端。例如,为了保证网站检测的准确性,步骤201中云平台可以选取多台目标边缘设备共同对目标URL进行检测,这样,目标边缘设备生成分析结果后,还可以将分析结果先反馈给云平台,云平台可以对所有目标边缘设备反馈的分析结果进行汇总整理,然后向网站检测请求的发送端反馈汇总整理后的分析结果。
例如,边缘设备还可以利用图片分析结果对图片分析模型进行模型强化训练,以优化更新图片分析模型,相应的处理可以如下:目标边缘设备根据图片分析结果对图片分析模型进行训练,以更新图片分析模型的模型参数。
在实施中,每台边缘设备上均可以设置有模型训练模块,通过该模型训练模块边缘设备可以不断优化其上的图片分析模型。以目标边缘设备为例,目标边缘设备在通过图片分析模型生成图片分析结果之后,可以将该图片分析结果输入上述模型训练模块,从而可以根据图片分析结果对图片分析模型进行强化训练,以更新图片分析模型的模型参数。当然,在另一实施例中,模型训练模块的功能可以由另一独立的模型训练设备来实现,模型训练设备可以通过与边缘设备交互的方式实现上述图片分析模型的训练处理。
例如,为了保证模型训练有效,可以仅选取正确的图片分析结果对图片分析模型进行训练,相应的处理可以如下:如果接收到发送端发送的结果确认消息,目标边缘设备则根据图片分析结果对图片分析模型进行训练,否则丢弃图片分析结果。
在实施中,目标边缘设备向网站检测请求的发送端反馈分析结果之后,可以检测发送端是否反馈有结果确认消息。如果接收到发送端发送的结果确认消息,目标边缘设备则可以确定本次图片分析正确,进而可以根据图片分析结果对图片分析模型进行训练,而如果未接收到结果确认消息,或者接收到结果错误消息,目标边缘设备则可以丢弃本次图片分析结果。同时,目标边缘设备还可以在接收到结果错误消息后,更新图片分析错误的总次数,当总次数到达预设次数阈值时,可以主动暂停网站检测服务。
例如,在利用图片分析结果对图片分析模型进行强化训练前,可以对图片分析结果进行一定调整,以确保模型训练的有效性,相应的处理可以如下:目标边缘设备基于预设的图片信息检测算法对图片分析结果进行检测,根据检测结果调整图片分析结果;或者,目标边缘设备接收针对图片分析结果的人工调整指令,根据人工调整指令调整图片分析结果。
在实施中,目标边缘设备在利用生成的图片分析结果对图片分析模型进行训练之前,可以先对图片分析结果进行调整,保证图片分析结果的正确性。一种方式下,目标边缘设备上可以预先设置有图片信息检测算法,用于对违法违规图片进行检测,以确认图片中是否确实存在违法违规内容。这样,目标边缘设备可以基于预设的图片信息检测算法对图片分析结果进行检测,然后根据检测结果调整图片分析结果。在另一种方式下,边缘计算系统的技术人员可以人工对图片分析结果进行检验,为了降低人工检验任务量,考虑到违法违规图片占总数的比例较低,技术人员可以仅对分析出存在违法违规内容的图片分析结果进行人工检验,然后以人工调整指令的方式控制边缘设备调整图片分析结果。这样,目标边缘设备在接收到针对图片分析结果的人工调整指令之后,可以根据人工调整指令调整图片分析结果。
例如,云平台还可以定期聚合并更新所有边缘节点的图片分析模型的模型参数,相应的处理可以如下:目标边缘设备周期性向云平台发送图片分析模型的模型参数;云平台周期性基于每台边缘设备最新上传的图片分析模型的模型参数,更新每台边缘设备对应的图片分析模型的模型参数;云平台向每台边缘设备反馈对应的更新后的图片分析模型的模型参数。
在实施中,边缘计算系统中包含目标边缘设备在内的所有边缘设备可以周期性地向云平台发送图片分析模型的模型参数。这样,云平台可以周期性地 基于每台边缘设备最新上传的图片分析模型的模型参数,更新每台边缘设备对应的图片分析模型的模型参数,然后向每台边缘设备反馈对应的更新后的图片分析模型的模型参数,从而可以保证每台边缘设备上的图片分析模型的模型参数的准确性。值得一提的是,如果边缘计算系统中不同边缘设备用于负责不同类型的网站检测处理,则云平台在更新图片分析模型的模型参数时,可以按照负责的类型对同一类型下的图片分析模型进行统一更新,这样,使得图片分析模型可以更能针对性地准确检测对应类型下的网站页面。
例如,如图3所示,边缘计算系统中可以包括负载均衡设备和多个云平台,其中,负载均衡设备接收携带有目标URL的网站检测请求,根据多个云平台的运行状态,将网站检测请求转发至目标云平台。
在实施中,边缘计算系统中可以设置有多个云平台,已经用于在多个云平台间进行负载均衡的负载均衡设备。负载均衡设备可以实时获取多个云平台的运行状态,然后可以根据该运行状态将接收到该网站检测请求在多个云平台间进行分发。以步骤201中携带有目标URL的网站检测请求为例,用户可以向边缘计算系统发送该网站检测请求,该网站检测请求可以被引导至上述负载均衡设备。这样,负载均衡设备在接收到网站检测请求后,可以根据多个云平台的运行状态,将网站检测请求转发至目标云平台。此处选取目标云平台的处理可以是选择负载最低的云平台,或者选择性能最佳的云平台,或者按照其它选取原则,本实施例不进行限定。
本申请实施例中,云平台接收携带有目标URL的网站检测请求,将网站检测请求转发至目标URL对应的目标边缘设备;目标边缘设备获取目标URL对应的页面截图,基于预设的文字识别算法和/或图片分析模型分析页面截图,生成分析结果;目标边缘设备向网站检测请求的发送端反馈分析结果。这样,当需要对网站进行检测时,可以由分布式部署的边缘设备基于机器算法来执行,相对于统一由人工检测的方式,可以有效降低检测成本、提高检测效率,并且减少中心负载和检测压力;同时由于边缘设备距离网站的源站较近,可以减少带宽流量的消耗,缩短检测延时。
基于相同的技术构思,本申请实施例还提供了一种网站检测的系统,所述系统包括云平台和分布式部署的多台边缘设备,其中:
所述云平台,用于接收携带有目标URL的网站检测请求,将所述网站检测 请求转发至所述目标URL对应的目标边缘设备;
所述目标边缘设备,用于获取所述目标URL对应的页面截图,基于预设的文字识别算法和/或图片分析模型分析所述页面截图,生成分析结果;
所述目标边缘设备,用于向所述网站检测请求的发送端反馈所述分析结果。
例如,所述目标边缘设备,具体用于:
基于OCR技术识别所述页面截图中的文字,并基于AC自动机算法将识别出的文字与违规文本库进行比对,生成文本分析结果;和/或,
基于图片分析模型检测所述页面截图中是否包含违规图片,生成图片分析结果。
例如,所述目标边缘设备,还用于:
根据所述图片分析结果对所述图片分析模型进行训练,以更新所述图片分析模型的模型参数。
例如,所述目标边缘设备,具体用于:
如果接收到所述发送端发送的结果确认消息,则根据所述图片分析结果对所述图片分析模型进行训练,否则丢弃所述图片分析结果。
例如,所述目标边缘设备,还用于:
在根据所述图片分析结果对所述图片分析模型进行训练之前,基于预设的图片信息检测算法对所述图片分析结果进行检测,根据检测结果调整所述图片分析结果;或者,
在根据所述图片分析结果对所述图片分析模型进行训练之前,接收针对所述图片分析结果的人工调整指令,根据所述人工调整指令调整所述图片分析结果。
例如,所述目标边缘设备,还用于周期性向所述云平台发送所述图片分析模型的模型参数;
所述云平台,还用于周期性基于每台边缘设备最新上传的图片分析模型的模型参数,更新每台边缘设备对应的图片分析模型的模型参数,向每台边缘设备反馈对应的更新后的图片分析模型的模型参数。
例如,所述系统包括负载均衡设备和多个云平台;
所述负载均衡设备,用于接收携带有目标URL的网站检测请求,根据多个所述云平台的运行状态,将所述网站检测请求转发至目标云平台。
本申请实施例中,云平台接收携带有目标URL的网站检测请求,将网站检测请求转发至目标URL对应的目标边缘设备;目标边缘设备获取目标URL对应的页面截图,基于预设的文字识别算法和/或图片分析模型分析页面截图,生成分析结果;目标边缘设备向网站检测请求的发送端反馈分析结果。这样,当需要对网站进行检测时,可以由分布式部署的边缘设备基于机器算法来执行,相对于统一由人工检测的方式,可以有效降低检测成本、提高检测效率,并且减少中心负载和检测压力;同时由于边缘设备距离网站的源站较近,可以减少带宽流量的消耗,缩短检测延时。
图4是本申请实施例提供的网络设备的结构示意图。该网络设备400可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上中央处理器422(例如,一个或一个以上处理器)和存储器432,一个或一个以上存储应用程序442或数据444的存储介质430(例如一个或一个以上海量存储设备)。其中,存储器432和存储介质430可以是短暂存储或持久存储。存储在存储介质430的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对网络设备400中的一系列指令操作。另外,中央处理器422可以设置为与存储介质430通信,在网络设备400上执行存储介质430中的一系列指令操作。
网络设备400还可以包括一个或一个以上电源429,一个或一个以上有线或无线网络接口450,一个或一个以上输入输出接口458,一个或一个以上键盘456,和/或,一个或一个以上操作系统441,例如Windows Server,Mac OS X,Unix,Linux,FreeBSD等等。
网络设备400可以包括有存储器,以及一个或者一个以上的程序,其中一个或者一个以上程序存储于存储器中,且经配置以由一个或者一个以上处理器执行所述一个或者一个以上程序包含用于进行上述网站检测中边缘设备的指令。
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。
以上所述仅为本申请的部分实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请 的保护范围之内。

Claims (16)

  1. 一种网站检测的方法,所述方法应用于边缘计算系统,所述边缘计算系统包括云平台和分布式部署的多台边缘设备,其中:
    所述云平台接收携带有目标URL的网站检测请求,将所述网站检测请求转发至所述目标URL对应的目标边缘设备;
    所述目标边缘设备获取所述目标URL对应的页面截图,基于预设的文字识别算法和/或图片分析模型分析所述页面截图,生成分析结果;
    所述目标边缘设备向所述网站检测请求的发送端反馈所述分析结果。
  2. 根据权利要求1所述的方法,其中,所述基于预设的文字识别算法和/或图片分析模型分析所述页面截图,生成分析结果,包括:
    所述目标边缘设备基于OCR技术识别所述页面截图中的文字,并基于AC自动机算法将识别出的文字与违规文本库进行比对,生成文本分析结果;和/或,
    所述目标边缘设备基于图片分析模型检测所述页面截图中是否包含违规图片,生成图片分析结果。
  3. 根据权利要求2所述的方法,其中,所述方法还包括:
    所述目标边缘设备根据所述图片分析结果对所述图片分析模型进行训练,以更新所述图片分析模型的模型参数。
  4. 根据权利要求3所述的方法,其中,所述目标边缘设备根据所述图片分析结果对所述图片分析模型进行训练,包括:
    如果接收到所述发送端发送的结果确认消息,所述目标边缘设备则根据所述图片分析结果对所述图片分析模型进行训练,否则丢弃所述图片分析结果。
  5. 根据权利要求3所述的方法,其中,所述目标边缘设备根据所述图片分析结果对所述图片分析模型进行训练之前,还包括:
    所述目标边缘设备基于预设的图片信息检测算法对所述图片分析结果进行检测,根据检测结果调整所述图片分析结果;或者,
    所述目标边缘设备接收针对所述图片分析结果的人工调整指令,根据所述人工调整指令调整所述图片分析结果。
  6. 根据权利要求3所述的方法,其中,所述方法还包括:
    所述目标边缘设备周期性向所述云平台发送所述图片分析模型的模型参数;
    所述云平台周期性基于每台边缘设备最新上传的图片分析模型的模型参数,更新每台边缘设备对应的图片分析模型的模型参数;
    所述云平台向每台边缘设备反馈对应的更新后的图片分析模型的模型参数。
  7. 根据权利要求1所述的方法,其中,所述边缘计算系统包括负载均衡设备和多个云平台;
    所述云平台接收携带有目标URL的网站检测请求之前,还包括:
    所述负载均衡设备接收携带有目标URL的网站检测请求,根据多个所述云平台的运行状态,将所述网站检测请求转发至目标云平台。
  8. 一种网站检测的系统,所述系统包括云平台和分布式部署的多台边缘设备,其中:
    所述云平台,用于接收携带有目标URL的网站检测请求,将所述网站检测请求转发至所述目标URL对应的目标边缘设备;
    所述目标边缘设备,用于获取所述目标URL对应的页面截图,基于预设的文字识别算法和/或图片分析模型分析所述页面截图,生成分析结果;
    所述目标边缘设备,用于向所述网站检测请求的发送端反馈所述分析结果。
  9. 根据权利要求8所述的系统,其中,所述目标边缘设备,具体用于:
    基于OCR技术识别所述页面截图中的文字,并基于AC自动机算法将识别出的文字与违规文本库进行比对,生成文本分析结果;和/或,
    基于图片分析模型检测所述页面截图中是否包含违规图片,生成图片分析结果。
  10. 根据权利要求9所述的系统,其中,所述目标边缘设备,还用于:
    根据所述图片分析结果对所述图片分析模型进行训练,以更新所述图片分析模型的模型参数。
  11. 根据权利要求10所述的系统,其中,所述目标边缘设备,具体用于:
    如果接收到所述发送端发送的结果确认消息,则根据所述图片分析结果对所述图片分析模型进行训练,否则丢弃所述图片分析结果。
  12. 根据权利要求10所述的系统,其中,所述目标边缘设备,还用于:
    在根据所述图片分析结果对所述图片分析模型进行训练之前,基于预设的图片信息检测算法对所述图片分析结果进行检测,根据检测结果调整所述图片分析结果;或者,
    在根据所述图片分析结果对所述图片分析模型进行训练之前,接收针对所述图片分析结果的人工调整指令,根据所述人工调整指令调整所述图片分析结果。
  13. 根据权利要求10所述的系统,其中,所述目标边缘设备,还用于周期性向所述云平台发送所述图片分析模型的模型参数;
    所述云平台,还用于周期性基于每台边缘设备最新上传的图片分析模型的模型参数,更新每台边缘设备对应的图片分析模型的模型参数,向每台边缘设备反馈对应的更新后的图片分析模型的模型参数。
  14. 根据权利要求8所述的系统,其中,所述系统包括负载均衡设备和多个云平台;
    所述负载均衡设备,用于接收携带有目标URL的网站检测请求,根据多个所述云平台的运行状态,将所述网站检测请求转发至目标云平台。
  15. 一种网络设备,所述网络设备包括处理器和存储器,所述存储器中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所 述至少一段程序、所述代码集或指令集由所述处理器加载并执行以实现如权利要求1至7任一所述的网站探测的方法中边缘设备的处理。
  16. 一种计算机可读存储介质,所述存储介质中存储有至少一条指令、至少一段程序、代码集或指令集,所述至少一条指令、所述至少一段程序、所述代码集或指令集由处理器加载并执行以实现如权利要求1至7任一所述的网站探测的方法中边缘设备的处理。
PCT/CN2019/096173 2019-05-29 2019-07-16 一种网站探测的方法和系统 WO2020237799A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP19917522.5A EP3771171A4 (en) 2019-05-29 2019-07-16 WEBSITE DETECTION METHOD AND SYSTEM
US17/028,807 US20210004628A1 (en) 2019-05-29 2020-09-22 Method and system for website detection

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910457676.6 2019-05-29
CN201910457676.6A CN110336790B (zh) 2019-05-29 2019-05-29 一种网站检测的方法和系统

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/028,807 Continuation US20210004628A1 (en) 2019-05-29 2020-09-22 Method and system for website detection

Publications (1)

Publication Number Publication Date
WO2020237799A1 true WO2020237799A1 (zh) 2020-12-03

Family

ID=68140584

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/096173 WO2020237799A1 (zh) 2019-05-29 2019-07-16 一种网站探测的方法和系统

Country Status (4)

Country Link
US (1) US20210004628A1 (zh)
EP (1) EP3771171A4 (zh)
CN (1) CN110336790B (zh)
WO (1) WO2020237799A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117541269A (zh) * 2023-12-08 2024-02-09 北京中数睿智科技有限公司 基于智能大模型的第三方模块数据实时监控方法及系统

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111368529B (zh) * 2020-03-17 2022-07-01 重庆邮电大学 基于边缘计算的移动终端敏感词识别方法、装置及系统
CN111783159A (zh) * 2020-07-07 2020-10-16 杭州安恒信息技术股份有限公司 网页篡改的验证方法、装置、计算机设备和存储介质
CN112565250B (zh) * 2020-12-04 2022-12-06 中国移动通信集团内蒙古有限公司 一种网站识别方法、装置、设备及存储介质
CN113688346A (zh) * 2021-08-16 2021-11-23 杭州安恒信息技术股份有限公司 一种违法网站识别方法、装置、设备及存储介质
CN114598623B (zh) * 2022-03-04 2024-04-05 北京沃东天骏信息技术有限公司 测试任务管理方法、装置、电子设备和存储介质
CN115277566B (zh) * 2022-05-20 2024-03-22 鸬鹚科技(深圳)有限公司 数据访问的负载均衡方法、装置、计算机设备及介质
CN115277694B (zh) * 2022-06-29 2023-12-08 北京奇艺世纪科技有限公司 一种数据获取方法、装置、系统、电子设备及存储介质
US11790031B1 (en) * 2022-10-31 2023-10-17 Content Square SAS Website change detection

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040158429A1 (en) * 2003-02-10 2004-08-12 Bary Emad Abdel Method and system for classifying content and prioritizing web site content issues
CN103685575A (zh) * 2014-01-06 2014-03-26 洪高颖 一种基于云架构的网站安全监控方法
CN103902889A (zh) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 一种恶意消息云检测方法和服务器
CN106657228A (zh) * 2016-09-27 2017-05-10 山东浪潮云服务信息科技有限公司 一种利用云端进行并发采集的爬虫实现方法
CN106874487A (zh) * 2017-02-21 2017-06-20 国信优易数据有限公司 一种分布式爬虫管理系统及其方法

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050097189A1 (en) * 2003-10-30 2005-05-05 Avaya Technology Corp. Automatic detection and dialing of phone numbers on web pages
US9282117B2 (en) * 2012-07-24 2016-03-08 Webroot Inc. System and method to provide automatic classification of phishing sites
CN102938716B (zh) * 2012-12-06 2016-06-01 网宿科技股份有限公司 内容分发网络加速测试方法和装置
CN106951484B (zh) * 2017-03-10 2020-10-30 百度在线网络技术(北京)有限公司 图片检索方法及装置、计算机设备及计算机可读介质
CN108574685B (zh) * 2017-03-14 2021-08-03 华为技术有限公司 一种流媒体推送方法、装置及系统
CN106888270B (zh) * 2017-03-30 2020-06-23 网宿科技股份有限公司 回源选路调度的方法和系统
US10601866B2 (en) * 2017-08-23 2020-03-24 International Business Machines Corporation Discovering website phishing attacks
CN107911360A (zh) * 2017-11-13 2018-04-13 哈尔滨工业大学(威海) 一种被黑网站检测方法及系统
CN108197465B (zh) * 2017-11-28 2020-12-08 中国科学院声学研究所 一种网址检测方法及装置
CN108768982B (zh) * 2018-05-17 2021-04-27 江苏通付盾信息安全技术有限公司 钓鱼网站的检测方法、装置、计算设备及计算机存储介质
CN108965245B (zh) * 2018-05-31 2021-04-13 国家计算机网络与信息安全管理中心 基于自适应异构多分类模型的钓鱼网站检测方法和系统
CN109255356B (zh) * 2018-07-24 2022-02-01 创新先进技术有限公司 一种文字识别方法、装置及计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040158429A1 (en) * 2003-02-10 2004-08-12 Bary Emad Abdel Method and system for classifying content and prioritizing web site content issues
CN103902889A (zh) * 2012-12-26 2014-07-02 腾讯科技(深圳)有限公司 一种恶意消息云检测方法和服务器
CN103685575A (zh) * 2014-01-06 2014-03-26 洪高颖 一种基于云架构的网站安全监控方法
CN106657228A (zh) * 2016-09-27 2017-05-10 山东浪潮云服务信息科技有限公司 一种利用云端进行并发采集的爬虫实现方法
CN106874487A (zh) * 2017-02-21 2017-06-20 国信优易数据有限公司 一种分布式爬虫管理系统及其方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP3771171A4 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117541269A (zh) * 2023-12-08 2024-02-09 北京中数睿智科技有限公司 基于智能大模型的第三方模块数据实时监控方法及系统

Also Published As

Publication number Publication date
CN110336790A (zh) 2019-10-15
CN110336790B (zh) 2021-05-25
EP3771171A4 (en) 2021-06-02
EP3771171A1 (en) 2021-01-27
US20210004628A1 (en) 2021-01-07

Similar Documents

Publication Publication Date Title
WO2020237799A1 (zh) 一种网站探测的方法和系统
US11122067B2 (en) Methods for detecting and mitigating malicious network behavior and devices thereof
US10812358B2 (en) Performance-based content delivery
US11381629B2 (en) Passive detection of forged web browsers
US10027739B1 (en) Performance-based content delivery
US9432389B1 (en) System, apparatus and method for detecting a malicious attack based on static analysis of a multi-flow object
WO2018121331A1 (zh) 攻击请求的确定方法、装置及服务器
EP4060958B1 (en) Attack behavior detection method and apparatus, and attack detection device
US9350757B1 (en) Detecting computer security threats in electronic documents based on structure
EP2755157A1 (en) Detecting undesirable content
US10764311B2 (en) Unsupervised classification of web traffic users
US20220398292A1 (en) Technologies for cross-device shared web resource cache
CN112565226A (zh) 请求处理方法、装置、设备及系统和用户画像生成方法
CN109450844B (zh) 触发漏洞检测的方法及装置
CN112637235A (zh) 一种通信方法、装置、设备及介质
WO2011103835A2 (zh) 用户访问的控制方法、装置及系统
US11445003B1 (en) Systems and methods for autonomous program detection
CN113778709A (zh) 接口调用方法、装置、服务器及存储介质
CN112804201A (zh) 一种获取设备信息的方法及装置
US10810302B2 (en) Database access monitoring with selective session information retrieval
CN112637171A (zh) 数据流量处理方法、装置、设备、系统和存储介质
US20200153794A1 (en) Database firewall for use by an application using a database connection pool
US11528289B2 (en) Security mechanisms for content delivery networks
US11755397B2 (en) Systems and methods for processing of messages subject to dead letter queues in representational state transfer architectures to prevent data loss in cloud-based computing environments
US20230319106A1 (en) Machine learning uniform resource locator (url) classifier

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 2019917522

Country of ref document: EP

Effective date: 20200907

NENP Non-entry into the national phase

Ref country code: DE