WO2019127658A1 - Method and system for identifying malicious image on the basis of url paths of similar images - Google Patents

Method and system for identifying malicious image on the basis of url paths of similar images Download PDF

Info

Publication number
WO2019127658A1
WO2019127658A1 PCT/CN2018/072242 CN2018072242W WO2019127658A1 WO 2019127658 A1 WO2019127658 A1 WO 2019127658A1 CN 2018072242 W CN2018072242 W CN 2018072242W WO 2019127658 A1 WO2019127658 A1 WO 2019127658A1
Authority
WO
WIPO (PCT)
Prior art keywords
address
url
database
weighting factor
picture
Prior art date
Application number
PCT/CN2018/072242
Other languages
French (fr)
Chinese (zh)
Inventor
蔡昭权
胡松
胡辉
蔡映雪
陈伽
黄翰
梁椅辉
罗伟
黄思博
Original Assignee
惠州学院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 惠州学院 filed Critical 惠州学院
Publication of WO2019127658A1 publication Critical patent/WO2019127658A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N7/00Television systems
    • H04N7/24Systems for the transmission of television signals using pulse code modulation

Definitions

  • the present disclosure pertains to the field of information security, for example, to a method of identifying a harmful picture and a system therefor.
  • the current technology can be divided into two major categories, one is the traditional method, mainly through various classifiers.
  • the other is the method of deep learning, especially the application of convolutional neural networks.
  • the above two methods have deficiencies in recognition efficiency.
  • the present disclosure provides a method for identifying a harmful picture based on an approximate map URL path, including:
  • Step a) when it is determined that the page element of the webpage includes a URL path of the image, identifying an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identifying the content recorded in the page content of the webpage User ID, and querying in the first database whether the IP address or the same network segment IP address exists, and/or querying whether the ID exists in the first database, and querying the result and/or ID according to the user's IP address.
  • the query result outputs a first weighting factor
  • Step d) integrating the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
  • the present disclosure also discloses a system for identifying harmful pictures based on an approximate map URL path, including:
  • a first weighting factor generating module configured to: identify, when the page element of the webpage includes a URL path of the webpage, identify an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identify the webpage
  • the query result and/or the ID query result output a first weighting factor
  • a second weighting factor generating module configured to: obtain a domain name included in the URL and/or an IP address pointed to by the URL according to a URL path of the image, and perform whois in the second database based on the domain name included in the URL Querying, and/or querying, according to the IP address pointed by the URL, whether the IP address or the same network segment IP address included in the URL exists in the second database, and the query result according to the whois query result and/or the IP address , outputting a second weighting factor;
  • a third weighting factor generating module configured to: input a URL path of the image into a third-party image database, search all approximate images of the image in a third-party image database, and obtain URL paths of all approximate images, and based on all approximations
  • the URL path of the figure obtains the domain name contained in the URL of all approximate maps and/or the IP address pointed to by the URL of the approximate graph; and,
  • the whois query is performed in the second database, and/or the IP address included in the URL is queried in the second database based on the IP address pointed to by the URLs of all approximate maps. Or the IP address of the same network segment, and output a third weighting factor according to the query result of the whois query and/or the IP address;
  • an identifying module configured to integrate the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
  • the present disclosure can combine a database created by big data, and can provide a scheme for identifying harmful pictures more efficiently without much image processing.
  • Figure 1 is a schematic illustration of the method of one embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a system in accordance with an embodiment of the present disclosure.
  • references to "an embodiment” herein mean that a particular feature, structure, or characteristic described in connection with the embodiments can be included in at least one embodiment of the present disclosure.
  • the appearances of the phrases in various places in the specification are not necessarily referring to the same embodiments, and are not exclusive or alternative embodiments that are mutually exclusive. Those skilled in the art will appreciate that the embodiments described herein can be combined with other embodiments.
  • FIG. 1 is a schematic flowchart diagram of a method for identifying a harmful picture based on an approximate path of a URL according to an embodiment of the present disclosure. As shown, the method includes:
  • Step S100 When it is determined that the page element of the webpage includes the URL path of the image, identify the IP address or IP address segment of the user recorded in the page content of the webpage, and/or identify the user recorded in the page content of the webpage. ID, and in the first database, query whether the IP address or the same network segment IP address exists, and/or query whether the ID exists in the first database, and query the result and/or ID according to the user's IP address. The result outputs a first weighting factor;
  • the first database maintains a known IP address or IP address segment of a user who has posted a harmful picture, and a list of user IDs that have posted harmful pictures.
  • IP address of the user recorded in the content of the web page is 192.168.10.3:
  • the first weighting factor may be exemplarily 1.0
  • IP address recorded in the database is only 192.168.10.4, then 192.168.10.3 is moderately suspected as the alternate address of the user who has posted the harmful picture or the newly replaced address, and the first weighting factor can be exemplified as 0.6;
  • the first weighting factor can be exemplified as 0.9;
  • the first weighting factor can be An example is 0.4.
  • the first weighting factor may be exemplarily 1.0
  • the first weighting factor can be exemplarily 0;
  • the above steps also have a comprehensive query of the user IP and ID, that is, by examining whether to publish or discuss the user IP and ID of the picture, whether it belongs to the IP (or IP address segment) already existing in the first database and / or ID.
  • the user's IP query factor is u
  • the ID query factor is v
  • the first weighting factor is x, where 0 ⁇ u ⁇ 1, 0 ⁇ v ⁇ 1, 0 ⁇ x ⁇ 1, and the first weighting factor can be determined according to the following formula. :
  • d and e are not equal, and may be adjusted according to the weight of each query factor and the actual situation of determining the first weighting factor.
  • the above formula for calculating x is a linear formula, but in practice, a nonlinear formula may also be used.
  • Step S200 Obtain a domain name included in the URL and/or an IP address pointed by the URL according to a URL path of the picture, perform a whois query in the second database, and/or based on the domain name included in the URL.
  • the second database maintains a list of known domain names that have posted harmful pictures and/or a list of known IP addresses and IP address segments of websites that have posted harmful pictures. Compared with the previous steps, it is easy to understand that the release here refers to which IP and/or domain name corresponds to the website.
  • the Whois query is to examine the association of domain name registrants with harmful images.
  • the second database can maintain the following information: the domain name, the information of the domain name registrant that publishes a large number of erotic pictures, reaction pictures, or cult pictures on the Internet, and the corresponding harmful picture.
  • the second weighting factor may be exemplarily 1.0
  • the second database does not record the identifier of any harmful image of the above domain name www.a.com, but can query the domain name registrant of the domain name, and the domain name of other websites registered by the domain name registrant of the domain name, and the second database Including the other websites publishing a large number of harmful pictures on the Internet, even if the second database does not record any harmful pictures of the above domain name www.a.com, the website corresponding to the domain name of www.a.com is still highly Suspected to be the source of the harmful picture, the second weighting factor can be exemplified as 0.9;
  • the second database does not record the identifier of any harmful image of the above domain name www.a.com, but can query the domain name registrant of the domain name, and the domain name of other websites registered by the domain name registrant of the domain name, the second database Does not include any identifier for the other website to publish harmful pictures, the second weighting factor may be exemplarily 0;
  • the second weighting factor can also An example is 0.
  • the IP address pointed by the URL may be obtained according to the URL path of the picture, and the IP address/IP address segment query is performed to output a second weighting factor.
  • IP address 192.168.20.3:
  • the second weighting factor may be exemplarily 1.0
  • IP address recorded in the second database is only 192.168.20.4, then 192.168.20.3 is slightly suspected as the alternate address of the website to which the picture belongs or the newly replaced address, and the second weighting factor can be exemplified as 0.6;
  • the second weighting factor can be exemplified as 0.9;
  • IP address recorded in the database includes multiple 192.168.XX network segments and there is no 192.168.20.X network segment, then 192.168.20.3 is cautiously suspected as the address of the website to which the harmful picture belongs.
  • the second weighting factor can be exemplified. Is 0.4.
  • the above steps also have a situation in which the IP list and the domain name list are comprehensively considered, that is, the case where the second weighting factor is jointly determined by the IP query of the picture URL and the domain name whois query.
  • the IP query factor of the picture URL is i
  • the domain name whois query factor is j
  • the second weighting factor is y, where 0 ⁇ i ⁇ 1, 0 ⁇ j ⁇ 1, 0 ⁇ y ⁇ 1, and the second formula can be determined according to the following formula Weighting factor:
  • m and n are not equal, and may be adjusted according to the weight of each query factor and the actual situation of determining the second weighting factor.
  • the above formula for calculating y belongs to the linear formula, but in practical applications, a nonlinear formula may also be used.
  • Step S300 input a URL path of the picture into a third-party picture database, search all approximate pictures of the picture in a third-party picture database, obtain URL paths of all approximate pictures, and obtain all approximations based on URL paths of all approximate pictures.
  • the step S300 is to perform a map search query in the third-party image database, and output a third weighting factor according to the IP and/or domain name whois query of the URL path of the approximate graph in the query result.
  • the third weighting factor is determined according to the query situation of the IP and/or domain name of the URL path of the approximate graph in the second database, for example, counting the number of occurrences of the whois information of the IP or domain name in the second database. It can be understood that when the number of occurrences satisfies the corresponding threshold condition, the third weighting factor may be 1.0, or may be 0.8 or 0.4, depending on the specific threshold condition.
  • step S300 still involves less image processing and its recognition.
  • Image processing is performed by a third party image database, and the present disclosure may not involve much image processing.
  • a third-party image database like www.tineye.com as an example.
  • the image is indeed an erotic image, and many similar images are found in a database like www.tineye.com, and the approximate image is in the URL.
  • the domain name and/or IP is also recorded in the second database.
  • the S300 step can give a third weighting factor, which may be 1.0, or may be 0.6 - Obviously, if the domain name and/or IP in the URL of all the approximated maps retrieved are recorded by the second database, the third weighting factor factor is likely to be 1.0. That is to say, step S300 is equivalent to scoring the domain name and/or IP corresponding to the URL of the approximation map to determine whether it belongs to the domain name and/or IP having the prior record, and if a considerable number of approximate URLs correspond to the domain name and/or Or IP has a history, then there is reason to highly suspect that the picture is a harmful picture.
  • step S300 does not exclude the prior art technical means for identifying harmful information of a picture, that is, the step S300 can perform image processing in combination with a conventional method, or can be combined with a deep learning model. Processing, which in turn identifies harmful images.
  • the third-party image database is based on the content to perform an approximate map search or based on other means, and the present disclosure is not limited.
  • Step S400 integrating the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
  • the first weighting factor is x
  • the second weighting factor is y
  • the third weighting factor is z, wherein 0 ⁇ x ⁇ 1, 0 ⁇ y ⁇ 1, 0 ⁇ z ⁇ 1, which can be integrated according to the following formula
  • the above weighting factor calculates the harmful coefficient of the picture W:
  • a, b, and c are not equal, and may be adjusted according to each weighting factor and the actual situation of identifying harmful content.
  • the formula for calculating W above is a linear formula, but in practice, a nonlinear formula may also be used.
  • Step S400 integrates (also referred to as fusion) multiple weighting factors to identify harmful pictures.
  • fusion also referred to as fusion
  • Those skilled in the art are aware that specific image processing and recognition are relatively time-consuming costs, while queries are relatively more time-saving. It will be apparent that the above embodiment proposes an efficient method of identifying harmful pictures. Additionally, the above-described embodiments are apparently capable of further integrating and updating the first database, the second database, and other databases in conjunction with big data and/or artificial intelligence.
  • the second database is a third party database.
  • the IP address information of the publisher of the unwanted picture recorded on the website is collected and the first database is updated.
  • harmful pictures generally form sticky users. Some of these users will participate in the transmission of harmful pictures and most of the IP addresses are relatively fixed. If the relevant website itself records the IP address information of the publisher of the harmful pictures, The present disclosure updates the aforementioned first database by collecting its IP address information.
  • step S200 further includes:
  • the security of the domain name is queried in a third-party domain name security list to output a security factor, and the second weighting factor related to the domain name is corrected by the security factor.
  • virustotal.com is a third-party domain name security screening website. It can be understood that if the third-party information believes that the relevant domain name contains a virus or a Trojan, the second weighting factor should be raised, which is rooted in the fact that the related website is more insecure.
  • the described embodiment focuses on correcting the second weighting factor from a network security perspective to prevent the user from suffering other losses. This is because cyber security is related to the privacy and property rights of users. If the websites related to harmful pictures have network security risks, they will bring harm to users or privacy damage in addition to the harmful pictures.
  • step S400 further includes: when the recognition is harmful, further submitting the picture to the third party picture database. In this way, it is convenient for the third-party image database to consider whether to update its data.
  • step S300 further includes the following:
  • Step c1) crawling audio in the webpage
  • the third weighting factor is modified. For example, increase the third weighting factor.
  • the present disclosure can effectively combine multiple dimensions and multiple modes, and combine IP information, domain name information, image information, and audio information to quickly identify harmful pictures.
  • the above embodiment may be implemented on the router side or the network provider side to filter related pictures in advance.
  • a system for identifying harmful pictures based on an approximate map URL path including:
  • a first weighting factor generating module configured to: identify, when the page element of the webpage includes a URL path of the webpage, identify an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identify the webpage
  • the query result and/or the ID query result output a first weighting factor
  • a second weighting factor generating module configured to: obtain a domain name included in the URL and/or an IP address pointed to by the URL according to a URL path of the image, and perform whois in the second database based on the domain name included in the URL Querying, and/or querying, according to the IP address pointed by the URL, whether the IP address or the same network segment IP address included in the URL exists in the second database, and the query result according to the whois query result and/or the IP address , outputting a second weighting factor;
  • a third weighting factor generating module configured to: input a URL path of the image into a third-party image database, search all approximate images of the image in a third-party image database, and obtain URL paths of all approximate images, and based on all approximations
  • the URL path of the figure obtains the domain name contained in the URL of all approximate maps and/or the IP address pointed to by the URL of the approximate graph; and,
  • the whois query is performed in the second database, and/or the IP address included in the URL is queried in the second database based on the IP address pointed to by the URLs of all approximate maps. Or the IP address of the same network segment, and output a third weighting factor according to the query result of the whois query and/or the IP address;
  • the identification module is configured to synthesize the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
  • the second database is a third party database.
  • the second weighting factor generating module further includes:
  • a correction unit configured to: further query the security of the domain name in the third-party domain name security list to output a security factor, and modify the second weighting factor related to the domain name by the security factor.
  • the identification module is further configured to: when the recognition is harmful, further submit the picture to the third-party picture database.
  • the third weighting factor generating module further corrects the third weighting factor by:
  • An audio crawling unit for crawling audio in the webpage for crawling audio in the webpage
  • An audio recognition unit for identifying whether harmful content is included in the audio, and if so, correcting the third weighting factor.
  • the present disclosure in another embodiment, discloses a system for identifying unwanted pictures, including:
  • processors and memory having stored therein executable instructions, the processor executing the instructions to perform the following operations:
  • Step a) when it is determined that the page element of the webpage includes a URL path of the image, identifying an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identifying the content recorded in the page content of the webpage User ID, and querying in the first database whether the IP address or the same network segment IP address exists, and/or querying whether the ID exists in the first database, and querying the result and/or ID according to the user's IP address.
  • the query result outputs a first weighting factor
  • Step d) integrating the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
  • the present disclosure in another embodiment, also discloses a computer storage medium storing executable instructions for performing a method of identifying a harmful picture as follows:
  • Step a) when it is determined that the page element of the webpage includes a URL path of the image, identifying an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identifying the content recorded in the page content of the webpage User ID, and querying in the first database whether the IP address or the same network segment IP address exists, and/or querying whether the ID exists in the first database, and querying the result and/or ID according to the user's IP address.
  • the query result outputs a first weighting factor
  • the whois query is performed in the second database, and/or the IP address included in the URL is queried in the second database based on the IP address pointed to by the URLs of all approximate maps. Or the IP address of the same network segment, and output a third weighting factor according to the query result of the whois query and/or the IP address;
  • Step d) integrating the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
  • the above system may comprise: at least one processor (eg CPU), at least one sensor (eg accelerometer, gyroscope, GPS module or other positioning module), at least one memory, at least one communication bus, wherein the communication bus To achieve connection communication between various components.
  • the device may further include at least one receiver, at least one transmitter, wherein the receiver and the transmitter may be wired transmission ports, or may be wireless devices (including, for example, including antenna devices) for signaling with other node devices. Or the transmission of data.
  • the memory may be a high speed RAM memory or a non-volatile memory such as at least one disk memory.
  • the memory may optionally be at least one storage device located remotely from the aforementioned processor.
  • a set of program code is stored in the memory, and the processor can call the code stored in the memory to perform related functions via the communication bus.
  • An embodiment of the present disclosure further provides a computer storage medium, wherein the computer storage medium can store a program, the program including some or all of the steps of the method for identifying a harmful picture described in the foregoing method embodiments.
  • Modules and units in the system of the embodiments of the present disclosure may be combined, divided, and deleted according to actual needs. It should be noted that, for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described action sequence. Because certain steps may be performed in other sequences or concurrently in accordance with the present invention. In the following, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions, modules, and units involved are not necessarily required by the present invention.
  • the disclosed system can be implemented in other manners.
  • the embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • there may be another division manner for example, multiple units or components may be combined or integrated. Go to another system, or some features can be ignored or not executed.
  • the coupling or direct coupling or communication connection of the various units or components to each other may be an indirect coupling or communication connection through some interfaces, devices or units, and may be electrical or otherwise.
  • the units described as separate components may or may not be physically separate, may be located in one place, or may be distributed over multiple network elements. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium.
  • the technical solution of the present disclosure may contribute to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause a computer device (which may be a smart phone, a personal digital assistant, a wearable device, a laptop, a tablet) to perform all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A method and system for identifying a malicious image on the basis of URL paths of similar images. The method comprises: when a page element of a webpage is determined to be comprising a URL path of an image, acquiring a user IP and/or ID recorded in the page content of the webpage, acquiring, on the basis of the URL path of the image, a domain name that the URL comprises or an IP address to which the URL points, and outputting a first weight factor and a second weight factor on the basis of a query related to the user ID, IP, and the domain name; also, searching in a third-party image database for all similar images of the images and acquiring URL paths of the similar images, querying domains/IPs that the URL of all of the similar images comprise and outputting a third weight factor; and, with the first weight factor, the second weight factor, and the third weight factor combined, identifying whether the image is a malicious image.

Description

一种基于近似图的URL路径识别有害图片的方法及系统Method and system for identifying harmful pictures based on URL path of approximate graph 技术领域Technical field
本公开属于信息安全领域,例如涉及一种识别有害图片的方法及其系统。The present disclosure pertains to the field of information security, for example, to a method of identifying a harmful picture and a system therefor.
背景技术Background technique
在信息社会,到处充斥信息流,包括但不限于文本、视频、音频、图片等。其中,与视频相比,图片文件既包括一定的视觉信息又对存储空间和带宽要求相对较低,随着移动互联网的普及,网络上充斥大量有害图片内容,例如涉及毒品、色情、暴力等非法内容的图片,或者诱导加入邪教、自杀群体、犯罪群体等的有害图片,由于视觉直观性、冲击性等特点,其危害性更加甚于有害文本和有害音频等,因此对这些有害图片进行识别,进而进行过滤、删除、消除危害,是十分必要的。In the information society, information flows are everywhere, including but not limited to text, video, audio, pictures, and so on. Among them, compared with video, image files include certain visual information and relatively low storage space and bandwidth requirements. With the popularity of mobile Internet, the network is full of harmful image content, such as illegal drugs, pornography, violence, etc. Pictures of content, or harmful images that are induced to join cults, suicide groups, criminal groups, etc., are more harmful than harmful texts and harmful audio due to their visual intuition and impact, so these harmful pictures are identified. It is necessary to filter, delete, and eliminate hazards.
对于网络有害图片的识别,现在的技术主要有可以分为两大类,一种是传统方法,主要通过各种分类器。另一种是深度学习的方法,特别是卷积神经网络的应用。然而以上两类方法在在识别效率上都有所不足。For the identification of harmful pictures on the network, the current technology can be divided into two major categories, one is the traditional method, mainly through various classifiers. The other is the method of deep learning, especially the application of convolutional neural networks. However, the above two methods have deficiencies in recognition efficiency.
在大数据和人工智能发展的情形下,如何高效的识别有害图片,就成为一个需要考虑的问题。In the case of the development of big data and artificial intelligence, how to effectively identify harmful pictures becomes a problem to be considered.
发明内容Summary of the invention
本公开提供了一种基于近似图的URL路径识别有害图片的方法,包括:The present disclosure provides a method for identifying a harmful picture based on an approximate map URL path, including:
步骤a),当判断出网页的页面元素包括图片的URL路径时,识别所述网页的页面内容中记载的用户的IP地址或IP地址段,和/或识别所述网页的页面内容中记载的用户ID,并在第一数据库中查询是否存在所述IP地址或同一网段IP地址,和/或在第一数据库中查询是否存在所述ID,并根据用户的IP地址查询结果和/或ID查询结果输出第一权重因子;Step a), when it is determined that the page element of the webpage includes a URL path of the image, identifying an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identifying the content recorded in the page content of the webpage User ID, and querying in the first database whether the IP address or the same network segment IP address exists, and/or querying whether the ID exists in the first database, and querying the result and/or ID according to the user's IP address. The query result outputs a first weighting factor;
步骤b),依据图片的URL路径获取所述URL中包含的域名和/或所述URL指向的IP地址,基于所述URL中包含的域名,在第二数据库中进行whois查询,和/或基于所述URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第二权重因子;Step b): obtaining a domain name included in the URL and/or an IP address pointed to by the URL according to a URL path of the picture, performing a whois query in the second database based on the domain name included in the URL, and/or based on The IP address pointed to by the URL, in the second database, whether the IP address included in the URL or the IP address of the same network segment exists, and the second weighting factor is output according to the query result of the whois query result and/or the IP address. ;
步骤c),将所述图片的URL路径输入第三方图片数据库,在第三方图片数据库 中搜索所述图片的所有近似图并获取所有近似图的URL路径,并基于所有近似图的URL路径获取所有近似图的URL中包含的域名和/或近似图的URL指向的IP地址;以及,基于所有近似图的URL中包含的域名,在第二数据库中进行whois查询,和/或基于所有近似图的URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第三权重因子;Step c): input the URL path of the picture into a third-party picture database, search all approximate pictures of the picture in a third-party picture database, obtain URL paths of all approximate pictures, and obtain all the URL paths based on all approximate pictures. The domain name contained in the URL of the approximation map and/or the IP address pointed to by the URL of the approximation map; and, based on the domain name contained in the URLs of all approximation maps, the whois query is performed in the second database, and/or based on all approximation maps The IP address pointed to by the URL, in the second database, whether the IP address included in the URL or the IP address of the same network segment exists, and the third weighting factor is output according to the query result of the whois query and/or the IP address;
步骤d),综合第一权重因子和第二权重因子以及第三权重因子,对所述图片是否属于有害图片进行识别。Step d), integrating the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
此外,本公开还揭示了一种基于近似图的URL路径识别有害图片的系统,包括:In addition, the present disclosure also discloses a system for identifying harmful pictures based on an approximate map URL path, including:
第一权重因子生成模块,用于:当判断出网页的页面元素包括图片的URL路径时,识别所述网页的页面内容中记载的用户的IP地址或IP地址段,和/或识别所述网页的页面内容中记载的用户ID,并在第一数据库中查询是否存在所述IP地址或同一网段IP地址,和/或在第一数据库中查询是否存在所述ID,并根据用户的IP地址查询结果和/或ID查询结果输出第一权重因子;a first weighting factor generating module, configured to: identify, when the page element of the webpage includes a URL path of the webpage, identify an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identify the webpage The user ID recorded in the page content, and in the first database, whether the IP address or the same network segment IP address exists, and/or whether the ID exists in the first database, and according to the user's IP address The query result and/or the ID query result output a first weighting factor;
第二权重因子生成模块,用于:依据图片的URL路径获取所述URL中包含的域名和/或所述URL指向的IP地址,基于所述URL中包含的域名,在第二数据库中进行whois查询,和/或基于所述URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第二权重因子;a second weighting factor generating module, configured to: obtain a domain name included in the URL and/or an IP address pointed to by the URL according to a URL path of the image, and perform whois in the second database based on the domain name included in the URL Querying, and/or querying, according to the IP address pointed by the URL, whether the IP address or the same network segment IP address included in the URL exists in the second database, and the query result according to the whois query result and/or the IP address , outputting a second weighting factor;
第三权重因子生成模块,用于:将所述图片的URL路径输入第三方图片数据库,在第三方图片数据库中搜索所述图片的所有近似图并获取所有近似图的URL路径,并基于所有近似图的URL路径获取所有近似图的URL中包含的域名和/或近似图的URL指向的IP地址;以及,a third weighting factor generating module, configured to: input a URL path of the image into a third-party image database, search all approximate images of the image in a third-party image database, and obtain URL paths of all approximate images, and based on all approximations The URL path of the figure obtains the domain name contained in the URL of all approximate maps and/or the IP address pointed to by the URL of the approximate graph; and,
基于所有近似图的URL中包含的域名,在第二数据库中进行whois查询,和/或基于所有近似图的URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第三权重因子;Based on the domain name contained in the URL of all approximation maps, the whois query is performed in the second database, and/or the IP address included in the URL is queried in the second database based on the IP address pointed to by the URLs of all approximate maps. Or the IP address of the same network segment, and output a third weighting factor according to the query result of the whois query and/or the IP address;
识别模块,用于综合第一权重因子和第二权重因子以及第三权重因子,对所 述图片是否属于有害图片进行识别。And an identifying module, configured to integrate the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
通过所述方法及其系统,本公开能够结合大数据所打造的数据库,无需多少图像处理即可较为高效的提供一种识别有害图片的方案。Through the method and its system, the present disclosure can combine a database created by big data, and can provide a scheme for identifying harmful pictures more efficiently without much image processing.
附图说明DRAWINGS
图1是本公开中一个实施例所述方法的示意图;Figure 1 is a schematic illustration of the method of one embodiment of the present disclosure;
图2是本公开中一个实施例所述系统的示意图。2 is a schematic diagram of a system in accordance with an embodiment of the present disclosure.
具体实施方式Detailed ways
为了使本领域技术人员理解本公开所披露的技术方案,下面将结合实施例及有关附图,对各个实施例的技术方案进行描述,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。本公开所采用的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,“包括”和“具有”以及它们的任何变形,意图在于覆盖且不排他的包含。例如包含了一系列步骤或单元的过程、或方法、或系统、或产品或设备没有限定于已列出的步骤或单元,而是可选的还包括没有列出的步骤或单元,或可选的还包括对于这些过程、方法、系统、产品或设备固有的其他步骤或单元。In order to make those skilled in the art understand the technical solutions disclosed in the present disclosure, the technical solutions of the various embodiments will be described below in conjunction with the embodiments and related drawings, which are a part of the embodiments of the present disclosure, instead of All embodiments. The terms "first", "second", etc., as used in this disclosure, are used to distinguish different objects, and are not intended to describe a particular order. Moreover, "including" and "having" and any variations thereof are intended to be inclusive and not exclusive. For example, a process, or method, or system, or product or device that comprises a series of steps or units is not limited to the listed steps or units, but optionally includes steps or units not listed, or optional Also includes other steps or units inherent to these processes, methods, systems, products, or devices.
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本公开的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其他实施例互斥的独立的或备选的实施例。本领域技术人员可以理解的是,本文所描述的实施例可以与其他实施例相结合。References to "an embodiment" herein mean that a particular feature, structure, or characteristic described in connection with the embodiments can be included in at least one embodiment of the present disclosure. The appearances of the phrases in various places in the specification are not necessarily referring to the same embodiments, and are not exclusive or alternative embodiments that are mutually exclusive. Those skilled in the art will appreciate that the embodiments described herein can be combined with other embodiments.
参见图1,图1是本公开中一个实施例提供的一种基于近似图的URL路径识别有害图片的方法的流程示意图。如图所示,所述方法包括:Referring to FIG. 1 , FIG. 1 is a schematic flowchart diagram of a method for identifying a harmful picture based on an approximate path of a URL according to an embodiment of the present disclosure. As shown, the method includes:
步骤S100,当判断出网页的页面元素包括图片的URL路径时,识别所述网页的页面内容中记载的用户的IP地址或IP地址段,和/或识别所述网页的页面内容中记载的用户ID,并在第一数据库中查询是否存在所述IP地址或同一网段IP地址,和/或在第一数据库中查询是否存在所述ID,并根据用户的IP地址查询结果和/或ID查询结果输出第一权重因子;Step S100: When it is determined that the page element of the webpage includes the URL path of the image, identify the IP address or IP address segment of the user recorded in the page content of the webpage, and/or identify the user recorded in the page content of the webpage. ID, and in the first database, query whether the IP address or the same network segment IP address exists, and/or query whether the ID exists in the first database, and query the result and/or ID according to the user's IP address. The result outputs a first weighting factor;
能够理解,第一数据库维护已知的、发布过有害图片的用户的IP地址或IP地址段,以及发布过有害图片的用户ID清单。It can be understood that the first database maintains a known IP address or IP address segment of a user who has posted a harmful picture, and a list of user IDs that have posted harmful pictures.
这是因为,有害图片一般会形成一些粘性用户,这些用户有一部分会参与传播有害图片且大部分的IP地址和ID是相对固定,甚至相当部分用户的ID在不同的网站或论坛都是相同的ID。This is because harmful images generally form sticky users. Some of these users will participate in the transmission of harmful pictures and most of the IP addresses and IDs are relatively fixed. Even a considerable number of users have the same ID on different websites or forums. ID.
例如,当识别出网页页面内容中记载的用户的IP地址是192.168.10.3的情形下:For example, when it is recognized that the IP address of the user recorded in the content of the web page is 192.168.10.3:
如果第一数据库中记载有该IP地址,那么第一权重因子可以示例性为1.0;If the IP address is recorded in the first database, the first weighting factor may be exemplarily 1.0;
如果数据库中记载的IP地址只有192.168.10.4,那么192.168.10.3则被中度怀疑为曾经发布有害图片的用户的备用地址或者新近更换的地址,第一权重因子可以示例性为0.6;If the IP address recorded in the database is only 192.168.10.4, then 192.168.10.3 is moderately suspected as the alternate address of the user who has posted the harmful picture or the newly replaced address, and the first weighting factor can be exemplified as 0.6;
如果数据库中记载的IP地址有192.168.10.4以及192.168.10.5,甚至记载了192.168.10.X网段的所有IP地址,那么192.168.10.3则被高度怀疑为曾经发布有害图片的用户的备用地址或者新近更换的地址,第一权重因子可以示例性为0.9;If the IP addresses recorded in the database are 192.168.10.4 and 192.168.10.5, and even all the IP addresses of the 192.168.10.X network segment are recorded, then 192.168.10.3 is highly suspected as the alternate address of the user who has posted harmful pictures or The newly changed address, the first weighting factor can be exemplified as 0.9;
如果数据库中记载的IP地址中包括多个192.168.X.X网段,而没有192.168.10.X网段,那么192.168.10.3则被谨慎怀疑为曾经发布有害图片的用户的地址,第一权重因子可以示例性为0.4。If the IP address recorded in the database includes multiple 192.168.XX network segments and no 192.168.10.X network segment, then 192.168.10.3 is cautiously suspected as the address of the user who has posted harmful pictures. The first weighting factor can be An example is 0.4.
又例如,识别到的用户ID叫“tudou”的情形下:For another example, in the case where the identified user ID is called "tudou":
如果第一数据库中记载有名为“tudou”的用户ID,那么第一权重因子可以示例性为1.0;If the user ID named "tudou" is recorded in the first database, the first weighting factor may be exemplarily 1.0;
如果数据库中记载的ID有“tudou1”、“tudou2”、“tudou*”、或者近似的ID,那么“tudou”则被轻度怀疑为相同用户的备用ID,第一权重因子可以示例性为0.3;If the ID recorded in the database has "tudou1", "tudou2", "tudou*", or an approximate ID, then "tudou" is slightly suspected as the alternate ID of the same user, and the first weighting factor can be exemplified as 0.3. ;
如果数据库中记载ID没有“tudou”或相近似的ID,那么第一权重因子可以示例性为0;If the ID in the database does not have "tudou" or an approximate ID, then the first weighting factor can be exemplarily 0;
特别的,上述步骤还存在综合查询用户IP和ID的情形,即通过考察发布或讨论有关图片的用户IP和ID,考察其是否属于在第一数据库中已经存在的IP(或IP地址段)和/或ID。In particular, the above steps also have a comprehensive query of the user IP and ID, that is, by examining whether to publish or discuss the user IP and ID of the picture, whether it belongs to the IP (or IP address segment) already existing in the first database and / or ID.
假设用户的IP查询因子为u,ID查询因子为v,第一权重因子为x,其中0≤u≤1,0≤v≤1,0≤x≤1,可以根据如下公式确定第一权重因子:Suppose the user's IP query factor is u, the ID query factor is v, and the first weighting factor is x, where 0≤u≤1, 0≤v≤1, 0≤x≤1, and the first weighting factor can be determined according to the following formula. :
x=d×u+e×v,其中,d+e=1,d、e则分别表示用户IP查询因子和ID查询因子的权重。x=d×u+e×v, where d+e=1, d, e represent the weights of the user IP query factor and the ID query factor, respectively.
例如,d=e=1/2;For example, d=e=1/2;
更例如,d、e不相等,具体可以根据各个查询因子的权重以及确定第一权重因子的实际情况而调整。For example, d and e are not equal, and may be adjusted according to the weight of each query factor and the actual situation of determining the first weighting factor.
能够理解,x越接近1,第一权重因子越重,相关图片属于数据库中的用户所参与的几率越大。It can be understood that the closer x is to 1, the heavier the first weighting factor is, and the greater the probability that the related picture belongs to the user in the database.
以上计算x的公式属于线性公式,然而实际应用时,也可能采用非线性公式。The above formula for calculating x is a linear formula, but in practice, a nonlinear formula may also be used.
进一步的,无论是线性公式还是非线性公式,均可以考虑通过训练或拟合来确定相关公式及其参数。Further, whether it is a linear formula or a nonlinear formula, it can be considered to determine the relevant formula and its parameters by training or fitting.
步骤S200,依据图片的URL路径获取所述URL中包含的域名和/或所述URL指向的IP地址,基于所述URL中包含的域名,在第二数据库中进行whois查询,和/或基于所述URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第二权重因子;Step S200: Obtain a domain name included in the URL and/or an IP address pointed by the URL according to a URL path of the picture, perform a whois query in the second database, and/or based on the domain name included in the URL. The IP address pointed to by the URL, in the second database, whether the IP address included in the URL or the IP address of the same network segment exists, and the second weighting factor is output according to the query result of the whois query and/or the IP address;
能够理解,第二数据库维护已知的、发布过有害图片的域名清单和/或已知的发布过有害图片的网站的IP地址、IP地址段清单。与前述步骤相比较,容易理解,此处的发布,指的是在哪些IP和/或域名所对应的网站上发布。It can be understood that the second database maintains a list of known domain names that have posted harmful pictures and/or a list of known IP addresses and IP address segments of websites that have posted harmful pictures. Compared with the previous steps, it is easy to understand that the release here refers to which IP and/or domain name corresponds to the website.
Whois查询是为了考察域名注册人与有害图片的关联情况。第二数据库可以维护如下信息:域名、互联网上大量发布色情图片、反动图片、或邪教图片的域名注册人的信息以及对应的有害图片的标识。The Whois query is to examine the association of domain name registrants with harmful images. The second database can maintain the following information: the domain name, the information of the domain name registrant that publishes a large number of erotic pictures, reaction pictures, or cult pictures on the Internet, and the corresponding harmful picture.
例如,域名是www.a.com的情形下:For example, if the domain name is www.a.com:
如果第二数据库中记载有该域名地址、相应有害图片的标识及其whois信息,那么第二权重因子可以示例性为1.0;If the domain name address, the identifier of the corresponding harmful picture, and its whois information are recorded in the second database, the second weighting factor may be exemplarily 1.0;
如果第二数据库中没有记载上述域名www.a.com的任何有害图片的标识,但是能够查询到该域名的域名注册人,以及该域名的域名注册人注册的其他网站的域名,且第二数据库包括所述其他网站在互联网上大量发布有害图片的标识,那么即使第二数据库中没有记载上述域名www.a.com的任何有害图片的标识,www.a.com该域名对应的网站依然被高度怀疑为有害图片的来源,所述第二权重 因子可以示例性为0.9;If the second database does not record the identifier of any harmful image of the above domain name www.a.com, but can query the domain name registrant of the domain name, and the domain name of other websites registered by the domain name registrant of the domain name, and the second database Including the other websites publishing a large number of harmful pictures on the Internet, even if the second database does not record any harmful pictures of the above domain name www.a.com, the website corresponding to the domain name of www.a.com is still highly Suspected to be the source of the harmful picture, the second weighting factor can be exemplified as 0.9;
如果第二数据库中没有记载上述域名www.a.com的任何有害图片的标识,但是能够查询到该域名的域名注册人,以及该域名的域名注册人注册的其他网站的域名,然而第二数据库并不包括任何关于所述其他网站发布有害图片的标识,所述第二权重因子可以示例性为0;If the second database does not record the identifier of any harmful image of the above domain name www.a.com, but can query the domain name registrant of the domain name, and the domain name of other websites registered by the domain name registrant of the domain name, the second database Does not include any identifier for the other website to publish harmful pictures, the second weighting factor may be exemplarily 0;
容易理解,如果第二数据库中没有记载上述域名www.a.com的任何有害图片的标识,也查询不到该域名的域名注册人注册的其他网站的域名,那么所述第二权重因子也可以示例性为0。It is easy to understand that if the second database does not record the identifier of any harmful image of the above domain name www.a.com, and the domain name of other websites registered by the domain name registrant of the domain name is not queried, then the second weighting factor can also An example is 0.
示例性的,还可以依据图片的URL路径获取所述URL指向的IP地址,进行IP地址/IP地址段查询,来输出第二权重因子,Exemplarily, the IP address pointed by the URL may be obtained according to the URL path of the picture, and the IP address/IP address segment query is performed to output a second weighting factor.
例如,IP地址是192.168.20.3的情形下:For example, if the IP address is 192.168.20.3:
如果第二数据库中记载有该IP地址,那么第二权重因子可以示例性为1.0;If the IP address is recorded in the second database, the second weighting factor may be exemplarily 1.0;
如果第二数据库中记载的IP地址只有192.168.20.4,那么192.168.20.3则被轻度怀疑为该图片所属网站的备用地址或者新近更换的地址,第二权重因子可以示例性为0.6;If the IP address recorded in the second database is only 192.168.20.4, then 192.168.20.3 is slightly suspected as the alternate address of the website to which the picture belongs or the newly replaced address, and the second weighting factor can be exemplified as 0.6;
如果第二数据库中记载的IP地址有192.168.20.4以及192.168.20.5,甚至记载了192.168.20.X网段的所有IP地址,那么192.168.20.3则被高度怀疑为该图片所属网站的备用地址或者新近更换的地址,第二权重因子可以示例性为0.9;If the IP address recorded in the second database is 192.168.20.4 and 192.168.20.5, and even all the IP addresses of the 192.168.20.X network segment are recorded, then 192.168.20.3 is highly suspected as the alternate address of the website to which the picture belongs or The newly replaced address, the second weighting factor can be exemplified as 0.9;
如果数据库中记载的IP地址中包括多个192.168.X.X网段,而没有192.168.20.X网段,那么192.168.20.3则被谨慎怀疑为有害图片所属网站的地址,第二权重因子可以示例性为0.4。If the IP address recorded in the database includes multiple 192.168.XX network segments and there is no 192.168.20.X network segment, then 192.168.20.3 is cautiously suspected as the address of the website to which the harmful picture belongs. The second weighting factor can be exemplified. Is 0.4.
特别的,上述步骤还存在综合考虑IP清单和域名清单的情形,即通过图片URL的IP查询和域名whois查询来共同确定第二权重因子的情形。In particular, the above steps also have a situation in which the IP list and the domain name list are comprehensively considered, that is, the case where the second weighting factor is jointly determined by the IP query of the picture URL and the domain name whois query.
假设图片URL的IP查询因子为i,域名whois查询因子为j,第二权重因子为y,其中0≤i≤1,0≤j≤1,0≤y≤1,可以根据如下公式确定第二权重因子:Suppose the IP query factor of the picture URL is i, the domain name whois query factor is j, and the second weighting factor is y, where 0≤i≤1, 0≤j≤1, 0≤y≤1, and the second formula can be determined according to the following formula Weighting factor:
y=m×i+n×j,其中,m+n=1,m、n则分别表示IP查询因子和域名whois查询因子的权重。y=m×i+n×j, where m+n=1, m and n represent the weights of the IP query factor and the domain name whois query factor, respectively.
例如,m=n=1/2;For example, m=n=1/2;
更例如,m、n不相等,具体可以根据各个查询因子的权重以及确定第二权重因子的实际情况而调整。For example, m and n are not equal, and may be adjusted according to the weight of each query factor and the actual situation of determining the second weighting factor.
能够理解,y越接近1,第二权重因子就越重,相关图片属于有害图片的几率越大。It can be understood that the closer y is to 1, the heavier the second weighting factor is, and the greater the probability that the related picture belongs to a harmful picture.
以上计算y的公式属于线性公式,然而实际应用时,也可能采用非线性公式。The above formula for calculating y belongs to the linear formula, but in practical applications, a nonlinear formula may also be used.
进一步的,无论是线性公式还是非线性公式,均可以考虑通过训练或拟合来确定相关公式及其参数。Further, whether it is a linear formula or a nonlinear formula, it can be considered to determine the relevant formula and its parameters by training or fitting.
步骤S300,将所述图片的URL路径输入第三方图片数据库,在第三方图片数据库中搜索所述图片的所有近似图并获取所有近似图的URL路径,并基于所有近似图的URL路径获取所有近似图的URL中包含的域名和/或近似图的URL指向的IP地址;以及,基于所有近似图的URL中包含的域名,在第二数据库中进行whois查询,和/或基于所有近似图的URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第三权重因子;Step S300, input a URL path of the picture into a third-party picture database, search all approximate pictures of the picture in a third-party picture database, obtain URL paths of all approximate pictures, and obtain all approximations based on URL paths of all approximate pictures. The domain name contained in the URL of the figure and/or the IP address pointed to by the URL of the approximate map; and, based on the domain name contained in the URL of all approximate maps, the whois query is performed in the second database, and/or the URL based on all approximate maps Pointing to the IP address, querying in the second database whether the IP address or the IP address of the same network segment exists in the URL, and outputting a third weighting factor according to the query result of the whois query and/or the IP address;
该步骤S300是将目标图片在第三方图片数据库中进行以图找图查询,并根据查询结果中的近似图的URL路径的IP和/或域名whois查询来输出第三权重因子。根据近似图的URL路径的IP和/或域名在第二数据库中的查询情况,例如统计其IP或域名的whois信息在第二数据库中出现的次数,确定第三权重因子。能够理解,出现的次数满足相应的阈值条件时,第三权重因子可能是1.0,也可能是0.8或0.4,视具体阈值条件而定。The step S300 is to perform a map search query in the third-party image database, and output a third weighting factor according to the IP and/or domain name whois query of the URL path of the approximate graph in the query result. The third weighting factor is determined according to the query situation of the IP and/or domain name of the URL path of the approximate graph in the second database, for example, counting the number of occurrences of the whois information of the IP or domain name in the second database. It can be understood that when the number of occurrences satisfies the corresponding threshold condition, the third weighting factor may be 1.0, or may be 0.8 or 0.4, depending on the specific threshold condition.
另外,需要强调的是,步骤S300依然较少涉及图像处理及其识别。图像处理是第三方图片数据库进行的,本公开可以不涉及多少图像处理。以类似www.tineye.com这样的第三方图片数据库为例,假设所述图片的确是色情图片,且在类似www.tineye.com这样的数据库中也查找到了许多近似图,并且近似图的URL中的域名和/或IP也被第二数据库记录在内,那么能够理解,即使不对所述图片或近似图片进行任何图片识别,S300步骤也能给出第三权重因子,可能是1.0,也可能是0.6——显然,如果检索到的所有近似图的URL中的域名和/或IP都被第二数据库记录在内,那么第三权重因子因子极可能是1.0。也就是说,步骤S300相当于对近似图的URL对应的域名和/或IP在进行打分,判断其 是否属于有前科的域名和/或IP,如果相当数量的近似图的URL对应的域名和/或IP都存在前科,那么有理由高度怀疑所述图片属于有害图片。In addition, it should be emphasized that step S300 still involves less image processing and its recognition. Image processing is performed by a third party image database, and the present disclosure may not involve much image processing. Take a third-party image database like www.tineye.com as an example. Suppose the image is indeed an erotic image, and many similar images are found in a database like www.tineye.com, and the approximate image is in the URL. The domain name and/or IP is also recorded in the second database. It can be understood that even if no picture recognition is performed on the picture or the approximate picture, the S300 step can give a third weighting factor, which may be 1.0, or may be 0.6 - Obviously, if the domain name and/or IP in the URL of all the approximated maps retrieved are recorded by the second database, the third weighting factor factor is likely to be 1.0. That is to say, step S300 is equivalent to scoring the domain name and/or IP corresponding to the URL of the approximation map to determine whether it belongs to the domain name and/or IP having the prior record, and if a considerable number of approximate URLs correspond to the domain name and/or Or IP has a history, then there is reason to highly suspect that the picture is a harmful picture.
但是,步骤S300并不排斥现有技术中关于对图片的有害信息识别的技术手段,也就是说,所述步骤S300,既可以结合传统的方法进行图像的处理,也可以结合深度学习模型进行图像的处理,进而对有害图片进行识别。此外,第三方图片数据库是基于内容进行近似图的检索还是基于别的手段,本公开并不受限。However, step S300 does not exclude the prior art technical means for identifying harmful information of a picture, that is, the step S300 can perform image processing in combination with a conventional method, or can be combined with a deep learning model. Processing, which in turn identifies harmful images. In addition, the third-party image database is based on the content to perform an approximate map search or based on other means, and the present disclosure is not limited.
步骤S400,综合第一权重因子和第二权重因子以及第三权重因子,对所述图片是否属于有害图片进行识别。Step S400, integrating the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
示例性的,设第一权重因子为x,第二权重因子为y,第三权重因子为z,其中0≤x≤1,0≤y≤1,0≤z≤1,可以根据如下公式综合上述权重因子计算图片的有害系数W:Exemplarily, the first weighting factor is x, the second weighting factor is y, and the third weighting factor is z, wherein 0≤x≤1, 0≤y≤1, 0≤z≤1, which can be integrated according to the following formula The above weighting factor calculates the harmful coefficient of the picture W:
W=a×x+b×y+c×z,其中,a+b+c=1,a、b、c则分别表示各个权重因子的权重。W = a × x + b × y + c × z, where a + b + c = 1, a, b, c respectively represent the weight of each weighting factor.
例如,a=b=c=1/3;For example, a=b=c=1/3;
更例如,a、b、c不相等,具体可以根据各个权重因子以及识别有害内容的实际情况而调整。For example, a, b, and c are not equal, and may be adjusted according to each weighting factor and the actual situation of identifying harmful content.
能够理解,W越接近1,相关图片属于有害图片的几率越大。It can be understood that the closer W is to 1, the greater the probability that the related picture belongs to a harmful picture.
以上计算W的公式属于线性公式,然而实际应用时,也可能采用非线性公式。The formula for calculating W above is a linear formula, but in practice, a nonlinear formula may also be used.
进一步的,无论是线性公式还是非线性公式,均可以考虑通过训练或拟合来确定相关公式及其参数。Further, whether it is a linear formula or a nonlinear formula, it can be considered to determine the relevant formula and its parameters by training or fitting.
综上,对于上述实施例,所有步骤基本不涉及具体的图像处理,而是另辟蹊径,主要利用了相关查询、获得相关的权重因子。步骤S400则综合(也可称为融合)多个权重因子进行有害图片的识别。本领域技术人员均知晓,具体的图像处理、识别相对消耗时间成本,而查询则相对而言更加节省时间成本。显而易见,上述实施例提出了一种富有效率的识别有害图片的方法。另外,上述实施例显然能够进一步结合大数据和/或人工智能来建立、更新所述第一数据库、第二数据库以及其他数据库。In summary, for the above embodiments, all the steps basically do not involve specific image processing, but a different approach, mainly utilizing related queries and obtaining relevant weighting factors. Step S400 integrates (also referred to as fusion) multiple weighting factors to identify harmful pictures. Those skilled in the art are aware that specific image processing and recognition are relatively time-consuming costs, while queries are relatively more time-saving. It will be apparent that the above embodiment proposes an efficient method of identifying harmful pictures. Additionally, the above-described embodiments are apparently capable of further integrating and updating the first database, the second database, and other databases in conjunction with big data and/or artificial intelligence.
在另一个实施例中,所述第二数据库为第三方数据库。In another embodiment, the second database is a third party database.
例如,进行whois查询的众多网站、以及第三方维护的色情网站列表、暴力 网站列表、反动网站列表、邪教网站列表方面的数据库、或者记录了有害图片的网站的IP地址、IP地址段列表方面的数据库。。For example, a number of websites that perform whois queries, as well as lists of pornographic websites maintained by third parties, lists of violent websites, lists of reaction websites, databases of cult website lists, or lists of IP addresses and IP address lists of websites that record harmful pictures. database. .
在另一个实施例中,对于识别为有害图片的网址(例如论坛或网页),收集所述网址上记载的所述有害图片的发表者的IP地址信息并更新第一数据库。这是因为,有害图片一般会形成一些粘性用户,这些用户有一部分会参与传播有害图片且大部分的IP地址是相对固定,如果相关网址自身记载了所述有害图片的发表者的IP地址信息,本公开则通过收集其IP地址信息来更新前述第一数据库。In another embodiment, for a web address (eg, a forum or web page) identified as a harmful picture, the IP address information of the publisher of the unwanted picture recorded on the website is collected and the first database is updated. This is because harmful pictures generally form sticky users. Some of these users will participate in the transmission of harmful pictures and most of the IP addresses are relatively fixed. If the relevant website itself records the IP address information of the publisher of the harmful pictures, The present disclosure updates the aforementioned first database by collecting its IP address information.
在另一个实施例中,步骤S200还包括:In another embodiment, step S200 further includes:
进一步的,在第三方域名安全列表中查询所述域名的安全性以便输出安全因子,并通过所述安全因子对所述与域名相关的第二权重因子进行修正。Further, the security of the domain name is queried in a third-party domain name security list to output a security factor, and the second weighting factor related to the domain name is corrected by the security factor.
例如virustotal.com这一第三方域名安全筛查网站。能够理解,如果第三方信息中认为相关域名包含病毒或木马,则应当提高第二权重因子,根源在于相关网站更加不安全。For example, virustotal.com is a third-party domain name security screening website. It can be understood that if the third-party information believes that the relevant domain name contains a virus or a Trojan, the second weighting factor should be raised, which is rooted in the fact that the related website is more insecure.
能够理解,所述实施例是侧重于从网络安全角度修正第二权重因子,防止用户遭受其他损失。这是因为,网络安全事关用户的隐私和财产权,如果有害图片的相关网站存在网络安全隐患,那么除了有害图片的危害之外还对用户带来隐私泄露或财产损失的危害。It can be appreciated that the described embodiment focuses on correcting the second weighting factor from a network security perspective to prevent the user from suffering other losses. This is because cyber security is related to the privacy and property rights of users. If the websites related to harmful pictures have network security risks, they will bring harm to users or privacy damage in addition to the harmful pictures.
在另一个实施例中,步骤S400还包括:当识别为有害时,进一步将所述图片提交到所述第三方图片数据库。如此,便于第三方图片数据库考虑是否更新其数据。In another embodiment, step S400 further includes: when the recognition is harmful, further submitting the picture to the third party picture database. In this way, it is convenient for the third-party image database to consider whether to update its data.
在另一个实施例中,步骤S300还包括如下:In another embodiment, step S300 further includes the following:
步骤c1):在所述网页中爬行音频;Step c1): crawling audio in the webpage;
步骤c2):识别音频中是否包括有害内容,如果有,则修正第三权重因子。Step c2): Identify whether harmful content is included in the audio, and if so, correct the third weighting factor.
对于该实施例而言,如果识别到音频中包括色情内容、暴力内容、反动政治言论、邪教煽动性言论、或恐怖仇视方面的极端言论,这说明相关网站具备威胁性,则修正第三权重因子,例如增大第三权重因子。For this embodiment, if the audio is identified as including pornographic content, violent content, reactionary political speech, cult inflammatory speech, or horrific hatred, which indicates that the relevant website is threatening, then the third weighting factor is modified. For example, increase the third weighting factor.
如前文所述,如果结合大数据技术,本公开能够富有成效的结合多个维度、多种模式,结合IP信息、域名信息、图像信息、音频信息来快速的识别有害图片。As described above, if combined with big data technology, the present disclosure can effectively combine multiple dimensions and multiple modes, and combine IP information, domain name information, image information, and audio information to quickly identify harmful pictures.
更进一步的,上述实施例可以在路由器一侧、或者网络提供商一侧实施,提前过滤相关图片。Further, the above embodiment may be implemented on the router side or the network provider side to filter related pictures in advance.
与方法相对应的,参见图2,本公开在另一个实施例中揭示了一种基于近似图的URL路径识别有害图片的系统,包括:Corresponding to the method, referring to FIG. 2, the disclosure discloses, in another embodiment, a system for identifying harmful pictures based on an approximate map URL path, including:
第一权重因子生成模块,用于:当判断出网页的页面元素包括图片的URL路径时,识别所述网页的页面内容中记载的用户的IP地址或IP地址段,和/或识别所述网页的页面内容中记载的用户ID,并在第一数据库中查询是否存在所述IP地址或同一网段IP地址,和/或在第一数据库中查询是否存在所述ID,并根据用户的IP地址查询结果和/或ID查询结果输出第一权重因子;a first weighting factor generating module, configured to: identify, when the page element of the webpage includes a URL path of the webpage, identify an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identify the webpage The user ID recorded in the page content, and in the first database, whether the IP address or the same network segment IP address exists, and/or whether the ID exists in the first database, and according to the user's IP address The query result and/or the ID query result output a first weighting factor;
第二权重因子生成模块,用于:依据图片的URL路径获取所述URL中包含的域名和/或所述URL指向的IP地址,基于所述URL中包含的域名,在第二数据库中进行whois查询,和/或基于所述URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第二权重因子;a second weighting factor generating module, configured to: obtain a domain name included in the URL and/or an IP address pointed to by the URL according to a URL path of the image, and perform whois in the second database based on the domain name included in the URL Querying, and/or querying, according to the IP address pointed by the URL, whether the IP address or the same network segment IP address included in the URL exists in the second database, and the query result according to the whois query result and/or the IP address , outputting a second weighting factor;
第三权重因子生成模块,用于:将所述图片的URL路径输入第三方图片数据库,在第三方图片数据库中搜索所述图片的所有近似图并获取所有近似图的URL路径,并基于所有近似图的URL路径获取所有近似图的URL中包含的域名和/或近似图的URL指向的IP地址;以及,a third weighting factor generating module, configured to: input a URL path of the image into a third-party image database, search all approximate images of the image in a third-party image database, and obtain URL paths of all approximate images, and based on all approximations The URL path of the figure obtains the domain name contained in the URL of all approximate maps and/or the IP address pointed to by the URL of the approximate graph; and,
基于所有近似图的URL中包含的域名,在第二数据库中进行whois查询,和/或基于所有近似图的URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第三权重因子;Based on the domain name contained in the URL of all approximation maps, the whois query is performed in the second database, and/or the IP address included in the URL is queried in the second database based on the IP address pointed to by the URLs of all approximate maps. Or the IP address of the same network segment, and output a third weighting factor according to the query result of the whois query and/or the IP address;
识别模块,用于综合第一权重因子和第二权重因子以及第三权重因子,对所述图片是否属于有害图片进行识别。The identification module is configured to synthesize the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
与前文各个方法的实施例所类似的,Similar to the embodiments of the various methods described above,
优选的,所述第二数据库为第三方数据库。Preferably, the second database is a third party database.
更优选的,第二权重因子生成模块还包括:More preferably, the second weighting factor generating module further includes:
修正单元,用于:进一步的,在第三方域名安全列表中查询所述域名的安全性以便输出安全因子,并通过所述安全因子对所述与域名相关的第二权重因子进 行修正。And a correction unit, configured to: further query the security of the domain name in the third-party domain name security list to output a security factor, and modify the second weighting factor related to the domain name by the security factor.
更优选的,所述识别模块,还用于:当识别为有害时,进一步将所述图片提交到所述第三方图片数据库。More preferably, the identification module is further configured to: when the recognition is harmful, further submit the picture to the third-party picture database.
更优选的,所述第三权重因子生成模块中还通过如下单元修正第三权重因子:More preferably, the third weighting factor generating module further corrects the third weighting factor by:
音频爬行单元,用于在所述网页中爬行音频;An audio crawling unit for crawling audio in the webpage;
音频识别单元,用于识别音频中是否包括有害内容,如果有,则修正第三权重因子。An audio recognition unit for identifying whether harmful content is included in the audio, and if so, correcting the third weighting factor.
本公开在另一个实施例中揭示了一种识别有害图片的系统,包括:The present disclosure, in another embodiment, discloses a system for identifying unwanted pictures, including:
处理器及存储器,所述存储器中存储有可执行指令,所述处理器执行这些指令以执行以下操作:a processor and a memory having stored therein executable instructions, the processor executing the instructions to perform the following operations:
步骤a),当判断出网页的页面元素包括图片的URL路径时,识别所述网页的页面内容中记载的用户的IP地址或IP地址段,和/或识别所述网页的页面内容中记载的用户ID,并在第一数据库中查询是否存在所述IP地址或同一网段IP地址,和/或在第一数据库中查询是否存在所述ID,并根据用户的IP地址查询结果和/或ID查询结果输出第一权重因子;Step a), when it is determined that the page element of the webpage includes a URL path of the image, identifying an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identifying the content recorded in the page content of the webpage User ID, and querying in the first database whether the IP address or the same network segment IP address exists, and/or querying whether the ID exists in the first database, and querying the result and/or ID according to the user's IP address. The query result outputs a first weighting factor;
步骤b),依据图片的URL路径获取所述URL中包含的域名和/或所述URL指向的IP地址,基于所述URL中包含的域名,在第二数据库中进行whois查询,和/或基于所述URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第二权重因子;Step b): obtaining a domain name included in the URL and/or an IP address pointed to by the URL according to a URL path of the picture, performing a whois query in the second database based on the domain name included in the URL, and/or based on The IP address pointed to by the URL, in the second database, whether the IP address included in the URL or the IP address of the same network segment exists, and the second weighting factor is output according to the query result of the whois query result and/or the IP address. ;
步骤c),将所述图片的URL路径输入第三方图片数据库,在第三方图片数据库中搜索所述图片的所有近似图并获取所有近似图的URL路径,并基于所有近似图的URL路径获取所有近似图的URL中包含的域名和/或近似图的URL指向的IP地址;以及,基于所有近似图的URL中包含的域名,在第二数据库中进行whois查询,和/或基于所有近似图的URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第三权重因子;Step c): input the URL path of the picture into a third-party picture database, search all approximate pictures of the picture in a third-party picture database, obtain URL paths of all approximate pictures, and obtain all the URL paths based on all approximate pictures. The domain name contained in the URL of the approximation map and/or the IP address pointed to by the URL of the approximation map; and, based on the domain name contained in the URLs of all approximation maps, the whois query is performed in the second database, and/or based on all approximation maps The IP address pointed to by the URL, in the second database, whether the IP address included in the URL or the IP address of the same network segment exists, and the third weighting factor is output according to the query result of the whois query and/or the IP address;
步骤d),综合第一权重因子和第二权重因子以及第三权重因子,对所述图 片是否属于有害图片进行识别。Step d), integrating the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
本公开在另一个实施例中还揭示了一种计算机存储介质,存储有可执行指令,所述指令用于执行如下识别有害图片的方法:The present disclosure, in another embodiment, also discloses a computer storage medium storing executable instructions for performing a method of identifying a harmful picture as follows:
步骤a),当判断出网页的页面元素包括图片的URL路径时,识别所述网页的页面内容中记载的用户的IP地址或IP地址段,和/或识别所述网页的页面内容中记载的用户ID,并在第一数据库中查询是否存在所述IP地址或同一网段IP地址,和/或在第一数据库中查询是否存在所述ID,并根据用户的IP地址查询结果和/或ID查询结果输出第一权重因子;Step a), when it is determined that the page element of the webpage includes a URL path of the image, identifying an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identifying the content recorded in the page content of the webpage User ID, and querying in the first database whether the IP address or the same network segment IP address exists, and/or querying whether the ID exists in the first database, and querying the result and/or ID according to the user's IP address. The query result outputs a first weighting factor;
步骤b),依据图片的URL路径获取所述URL中包含的域名和/或所述URL指向的IP地址,基于所述URL中包含的域名,在第二数据库中进行whois查询,和/或基于所述URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第二权重因子;Step b): obtaining a domain name included in the URL and/or an IP address pointed to by the URL according to a URL path of the picture, performing a whois query in the second database based on the domain name included in the URL, and/or based on The IP address pointed to by the URL, in the second database, whether the IP address included in the URL or the IP address of the same network segment exists, and the second weighting factor is output according to the query result of the whois query result and/or the IP address. ;
步骤c),将所述图片的URL路径输入第三方图片数据库,在第三方图片数据库中搜索所述图片的所有近似图并获取所有近似图的URL路径,并基于所有近似图的URL路径获取所有近似图的URL中包含的域名和/或近似图的URL指向的IP地址;以及,Step c): input the URL path of the picture into a third-party picture database, search all approximate pictures of the picture in a third-party picture database, obtain URL paths of all approximate pictures, and obtain all the URL paths based on all approximate pictures. The domain name contained in the URL of the approximate graph and/or the IP address pointed to by the URL of the approximate graph; and,
基于所有近似图的URL中包含的域名,在第二数据库中进行whois查询,和/或基于所有近似图的URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第三权重因子;Based on the domain name contained in the URL of all approximation maps, the whois query is performed in the second database, and/or the IP address included in the URL is queried in the second database based on the IP address pointed to by the URLs of all approximate maps. Or the IP address of the same network segment, and output a third weighting factor according to the query result of the whois query and/or the IP address;
步骤d),综合第一权重因子和第二权重因子以及第三权重因子,对所述图片是否属于有害图片进行识别。Step d), integrating the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
对于上述系统,其可以包括:至少一个处理器(例如CPU),至少一个传感器(例如加速度计、陀螺仪、GPS模块或其他定位模块),至少一个存储器,至少一个通信总线,其中,通信总线用于实现各个组件之间的连接通信。所述设备还可以包括至少一个接收器,至少一个发送器,其中,接收器和发送器可以是有线发送端口,也可以是无线设备(例如包括天线装置),用于与其他节点设备进行信令或数据的传输。所述存储器可以是高速RAM存储器,也可以是非不稳定的 存储器(Non-volatile memory),例如至少一个磁盘存储器。存储器可选的可以是至少一个位于远离前述处理器的存储装置。存储器中存储一组程序代码,且所述处理器可通过通信总线,调用存储器中存储的代码以执行相关的功能。For the above system, it may comprise: at least one processor (eg CPU), at least one sensor (eg accelerometer, gyroscope, GPS module or other positioning module), at least one memory, at least one communication bus, wherein the communication bus To achieve connection communication between various components. The device may further include at least one receiver, at least one transmitter, wherein the receiver and the transmitter may be wired transmission ports, or may be wireless devices (including, for example, including antenna devices) for signaling with other node devices. Or the transmission of data. The memory may be a high speed RAM memory or a non-volatile memory such as at least one disk memory. The memory may optionally be at least one storage device located remotely from the aforementioned processor. A set of program code is stored in the memory, and the processor can call the code stored in the memory to perform related functions via the communication bus.
本公开的实施例还提供一种计算机存储介质,其中,该计算机存储介质可存储程序,该程序执行时包括上述方法实施例中记载的任何一种识别有害图片的方法的部分或全部步骤。An embodiment of the present disclosure further provides a computer storage medium, wherein the computer storage medium can store a program, the program including some or all of the steps of the method for identifying a harmful picture described in the foregoing method embodiments.
本公开的实施例方法中的步骤可以根据实际需要进行顺序调整、合并和删减。The steps in the method of the embodiment of the present disclosure may be sequentially adjusted, merged, and deleted according to actual needs.
本公开的实施例系统中的模块和单元可以根据实际需要进行合并、划分和删减。需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作、模块、单元并不一定是本发明所必须的。Modules and units in the system of the embodiments of the present disclosure may be combined, divided, and deleted according to actual needs. It should be noted that, for the foregoing method embodiments, for the sake of simple description, they are all expressed as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described action sequence. Because certain steps may be performed in other sequences or concurrently in accordance with the present invention. In the following, those skilled in the art should also understand that the embodiments described in the specification are all preferred embodiments, and the actions, modules, and units involved are not necessarily required by the present invention.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。In the above embodiments, the descriptions of the various embodiments are different, and the details that are not detailed in a certain embodiment can be referred to the related descriptions of other embodiments.
在本公开所提供的几个实施例中,应该理解到,所揭露的系统,可通过其它的方式实现。例如,以上所描述的实施例仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,各单元或组件相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。In the several embodiments provided by the present disclosure, it should be understood that the disclosed system can be implemented in other manners. For example, the embodiments described above are merely illustrative. For example, the division of the unit is only a logical function division. In actual implementation, there may be another division manner, for example, multiple units or components may be combined or integrated. Go to another system, or some features can be ignored or not executed. In addition, the coupling or direct coupling or communication connection of the various units or components to each other may be an indirect coupling or communication connection through some interfaces, devices or units, and may be electrical or otherwise.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,既可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separate, may be located in one place, or may be distributed over multiple network elements. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
另外,本公开的各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, or each unit may exist separately, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或 使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可为智能手机、个人数字助理、可穿戴设备、笔记本电脑、平板电脑)执行本公开的各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as a standalone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present disclosure may contribute to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium. A number of instructions are included to cause a computer device (which may be a smart phone, a personal digital assistant, a wearable device, a laptop, a tablet) to perform all or part of the steps of the methods described in various embodiments of the present disclosure. The foregoing storage medium includes: a U disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and the like. .
以上所述,以上实施例仅用以说明本公开的技术方案,而非对其限制;尽管参照前述实施例对本公开进行了详细的说明,本领域技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本公开的各实施例技术方案的范围。The above embodiments are only used to illustrate the technical solutions of the present disclosure, and are not intended to be limiting; although the present disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art will understand that The technical solutions described in the examples are modified, or equivalent to some of the technical features are included; and the modifications or substitutions do not depart from the scope of the technical solutions of the embodiments of the present disclosure.

Claims (12)

  1. 一种基于近似图的URL路径识别有害图片的方法,包括:A method for identifying harmful pictures based on an approximate map URL path, comprising:
    步骤a),当判断出网页的页面元素包括图片的URL路径时,识别所述网页的页面内容中记载的用户的IP地址或IP地址段,和/或识别所述网页的页面内容中记载的用户ID,并在第一数据库中查询是否存在所述IP地址或同一网段IP地址,和/或在第一数据库中查询是否存在所述ID,并根据用户的IP地址查询结果和/或ID查询结果输出第一权重因子;Step a), when it is determined that the page element of the webpage includes a URL path of the image, identifying an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identifying the content recorded in the page content of the webpage User ID, and querying in the first database whether the IP address or the same network segment IP address exists, and/or querying whether the ID exists in the first database, and querying the result and/or ID according to the user's IP address. The query result outputs a first weighting factor;
    步骤b),依据图片的URL路径获取所述URL中包含的域名和/或所述URL指向的IP地址,基于所述URL中包含的域名,在第二数据库中进行whois查询,和/或基于所述URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第二权重因子;Step b): obtaining a domain name included in the URL and/or an IP address pointed to by the URL according to a URL path of the picture, performing a whois query in the second database based on the domain name included in the URL, and/or based on The IP address pointed to by the URL, in the second database, whether the IP address included in the URL or the IP address of the same network segment exists, and the second weighting factor is output according to the query result of the whois query result and/or the IP address. ;
    步骤c),将所述图片的URL路径输入第三方图片数据库,在第三方图片数据库中搜索所述图片的所有近似图并获取所有近似图的URL路径,并基于所有近似图的URL路径获取所有近似图的URL中包含的域名和/或近似图的URL指向的IP地址;以及,基于所有近似图的URL中包含的域名,在第二数据库中进行whois查询,和/或基于所有近似图的URL指向的IP地址,在第二数据库中查询是否存所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第三权重因子;Step c): input the URL path of the picture into a third-party picture database, search all approximate pictures of the picture in a third-party picture database, obtain URL paths of all approximate pictures, and obtain all the URL paths based on all approximate pictures. The domain name contained in the URL of the approximation map and/or the IP address pointed to by the URL of the approximation map; and, based on the domain name contained in the URLs of all approximation maps, the whois query is performed in the second database, and/or based on all approximation maps The IP address pointed to by the URL, in the second database, whether to store the IP address included in the URL or the IP address of the same network segment, and output a third weighting factor according to the query result of the whois query and/or the IP address;
    步骤d),综合第一权重因子和第二权重因子以及第三权重因子,对所述图片是否属于有害图片进行识别。Step d), integrating the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
  2. 根据权利要求1所述的方法,其中,所述第二数据库为第三方数据库。The method of claim 1 wherein said second database is a third party database.
  3. 根据权利要求1所述的方法,其中,步骤b)还包括:The method of claim 1 wherein step b) further comprises:
    进一步的,在第三方域名安全列表中查询所述域名的安全性以便输出安全因子,并通过所述安全因子对所述第二权重因子进行修正。Further, the security of the domain name is queried in a third-party domain name security list to output a security factor, and the second weighting factor is corrected by the security factor.
  4. 根据权利要求1所述的方法,其中,步骤d)还包括:The method of claim 1 wherein step d) further comprises:
    当识别为有害时,进一步将所述图片提交到所述第三方图片数据库。When identified as harmful, the picture is further submitted to the third party picture database.
  5. 根据权利要求1所述的方法,其中,步骤c)还包括如下:The method of claim 1 wherein step c) further comprises the following:
    步骤c1):在所述网页中爬行音频;Step c1): crawling audio in the webpage;
    步骤c2):识别音频中是否包括有害内容,如果有,则修正第三权重因子。Step c2): Identify whether harmful content is included in the audio, and if so, correct the third weighting factor.
  6. 一种基于近似图的URL路径识别有害图片的系统,包括:A system for identifying harmful pictures based on an approximate map URL path, comprising:
    第一权重因子生成模块,用于:当判断出网页的页面元素包括图片的URL路径时,识别所述网页的页面内容中记载的用户的IP地址或IP地址段,和/或识别所述网页的页面内容中记载的用户ID,并在第一数据库中查询是否存在所述IP地址或同一网段IP地址,和/或在第一数据库中查询是否存在所述ID,并根据用户的IP地址查询结果和/或ID查询结果输出第一权重因子;a first weighting factor generating module, configured to: identify, when the page element of the webpage includes a URL path of the webpage, identify an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identify the webpage The user ID recorded in the page content, and in the first database, whether the IP address or the same network segment IP address exists, and/or whether the ID exists in the first database, and according to the user's IP address The query result and/or the ID query result output a first weighting factor;
    第二权重因子生成模块,用于:依据图片的URL路径获取所述URL中包含的域名和/或所述URL指向的IP地址,基于所述URL中包含的域名,在第二数据库中进行whois查询,和/或基于所述URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第二权重因子;a second weighting factor generating module, configured to: obtain a domain name included in the URL and/or an IP address pointed to by the URL according to a URL path of the image, and perform whois in the second database based on the domain name included in the URL Querying, and/or querying, according to the IP address pointed by the URL, whether the IP address or the same network segment IP address included in the URL exists in the second database, and the query result according to the whois query result and/or the IP address , outputting a second weighting factor;
    第三权重因子生成模块,用于:将所述图片的URL路径输入第三方图片数据库,在第三方图片数据库中搜索所述图片的所有近似图并获取所有近似图的URL路径,并基于所有近似图的URL路径获取所有近似图的URL中包含的域名和/或近似图的URL指向的IP地址;以及,基于所有近似图的URL中包含的域名,在第二数据库中进行whois查询,和/或基于以及,基于URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第三权重因子;a third weighting factor generating module, configured to: input a URL path of the image into a third-party image database, search all approximate images of the image in a third-party image database, and obtain URL paths of all approximate images, and based on all approximations The URL path of the figure obtains the domain name contained in the URL of all approximate maps and/or the IP address pointed to by the URL of the approximate map; and, based on the domain name contained in the URLs of all approximate maps, the whois query is performed in the second database, and / Or based on the IP address pointed to by the URL, in the second database, query whether the IP address or the IP address of the same network segment is included in the URL, and output the first according to the query result of the whois query result and/or the IP address. Three weighting factor;
    识别模块,用于综合第一权重因子和第二权重因子以及第三权重因子,对所述图片是否属于有害图片进行识别。The identification module is configured to synthesize the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
  7. 根据权利要求6所述的系统,其中,优选的,所述第二数据库为第三方数据库。The system of claim 6 wherein preferably said second database is a third party database.
  8. 根据权利要求6所述的系统,其中,第二权重因子生成模块还包括:The system of claim 6 wherein the second weighting factor generation module further comprises:
    修正单元,用于:进一步的,在第三方域名安全列表中查询所述域名的安全性以便输出安全因子,并通过所述安全因子对所述第二权重因子进行修正。And a correction unit, configured to: further query, in a third-party domain name security list, the security of the domain name to output a security factor, and modify the second weighting factor by using the security factor.
  9. 根据权利要求6所述的系统,其中,所述识别模块,还用于:当识别为有害时,进一步将所述图片提交到所述第三方图片数据库。The system of claim 6 wherein said identifying module is further for: when said identifying is harmful, further submitting said picture to said third party picture database.
  10. 根据权利要求6所述的系统,其中,所述第三权重因子生成模块中还通过如下单元修正第三权重因子:The system according to claim 6, wherein the third weighting factor generating module further corrects the third weighting factor by:
    音频爬行单元,用于在所述网页中爬行音频;An audio crawling unit for crawling audio in the webpage;
    音频识别单元,用于识别音频中是否包括有害内容,如果有,则修正第三权重因子。An audio recognition unit for identifying whether harmful content is included in the audio, and if so, correcting the third weighting factor.
  11. 一种识别有害图片的系统,包括:A system for identifying unwanted images, including:
    处理器及存储器,所述存储器中存储有可执行指令,所述处理器执行这些指令以执行以下操作:a processor and a memory having stored therein executable instructions, the processor executing the instructions to perform the following operations:
    步骤a),当判断出网页的页面元素包括图片的URL路径时,识别所述网页的页面内容中记载的用户的IP地址或IP地址段,和/或识别所述网页的页面内容中记载的用户ID,并在第一数据库中查询是否存在所述IP地址或同一网段IP地址,和/或在第一数据库中查询是否存在所述ID,并根据用户的IP地址查询结果和/或ID查询结果输出第一权重因子;Step a), when it is determined that the page element of the webpage includes a URL path of the image, identifying an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identifying the content recorded in the page content of the webpage User ID, and querying in the first database whether the IP address or the same network segment IP address exists, and/or querying whether the ID exists in the first database, and querying the result and/or ID according to the user's IP address. The query result outputs a first weighting factor;
    步骤b),依据图片的URL路径获取所述URL中包含的域名和/或所述URL指向的IP地址,基于所述URL中包含的域名,在第二数据库中进行whois查询,和/或基于所述URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第二权重因子;Step b): obtaining a domain name included in the URL and/or an IP address pointed to by the URL according to a URL path of the picture, performing a whois query in the second database based on the domain name included in the URL, and/or based on The IP address pointed to by the URL, in the second database, whether the IP address included in the URL or the IP address of the same network segment exists, and the second weighting factor is output according to the query result of the whois query result and/or the IP address. ;
    步骤c),将所述图片的URL路径输入第三方图片数据库,在第三方图片数据库中搜索所述图片的所有近似图并获取所有近似图的URL路径,并基于所有近似图的URL路径获取所有近似图的URL中包含的域名和/或近似图的URL指向的IP地址;以及,基于所有近似图的URL中包含的域名,在第二数据库中进行whois查询,和/或基于所有近似图的URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第三权重因子;Step c): input the URL path of the picture into a third-party picture database, search all approximate pictures of the picture in a third-party picture database, obtain URL paths of all approximate pictures, and obtain all the URL paths based on all approximate pictures. The domain name contained in the URL of the approximation map and/or the IP address pointed to by the URL of the approximation map; and, based on the domain name contained in the URLs of all approximation maps, the whois query is performed in the second database, and/or based on all approximation maps The IP address pointed to by the URL, in the second database, whether the IP address included in the URL or the IP address of the same network segment exists, and the third weighting factor is output according to the query result of the whois query and/or the IP address;
    步骤d),综合第一权重因子和第二权重因子以及第三权重因子,对所述图片是否属于有害图片进行识别。Step d), integrating the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
  12. 一种计算机存储介质,存储有可执行指令,所述指令用于执行如下识别有害图片的方法:A computer storage medium storing executable instructions for performing a method of identifying a harmful picture as follows:
    步骤a),当判断出网页的页面元素包括图片的URL路径时,识别所述网页的页面内容中记载的用户的IP地址或IP地址段,和/或识别所述网页的页面内容 中记载的用户ID,并在第一数据库中查询是否存在所述IP地址或同一网段IP地址,和/或在第一数据库中查询是否存在所述ID,并根据用户的IP地址查询结果和/或ID查询结果输出第一权重因子;Step a), when it is determined that the page element of the webpage includes a URL path of the image, identifying an IP address or an IP address segment of the user recorded in the page content of the webpage, and/or identifying the content recorded in the page content of the webpage User ID, and querying in the first database whether the IP address or the same network segment IP address exists, and/or querying whether the ID exists in the first database, and querying the result and/or ID according to the user's IP address. The query result outputs a first weighting factor;
    步骤b),依据图片的URL路径获取所述URL中包含的域名和/或所述URL指向的IP地址,基于所述URL中包含的域名,在第二数据库中进行whois查询,和/或基于所述URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第二权重因子;Step b): obtaining a domain name included in the URL and/or an IP address pointed to by the URL according to a URL path of the picture, performing a whois query in the second database based on the domain name included in the URL, and/or based on The IP address pointed to by the URL, in the second database, whether the IP address included in the URL or the IP address of the same network segment exists, and the second weighting factor is output according to the query result of the whois query result and/or the IP address. ;
    步骤c),将所述图片的URL路径输入第三方图片数据库,在第三方图片数据库中搜索所述图片的所有近似图并获取所有近似图的URL路径,并基于所有近似图的URL路径获取所有近似图的URL中包含的域名和/或近似图的URL指向的IP地址;以及,Step c): input the URL path of the picture into a third-party picture database, search all approximate pictures of the picture in a third-party picture database, obtain URL paths of all approximate pictures, and obtain all the URL paths based on all approximate pictures. The domain name contained in the URL of the approximate graph and/or the IP address pointed to by the URL of the approximate graph; and,
    基于所有近似图的URL中包含的域名,在第二数据库中进行whois查询,和/或基于所有近似图的URL指向的IP地址,在第二数据库中查询是否存在所述URL中包含的IP地址或同一网段IP地址,并根据whois查询结果和/或IP地址的查询结果,输出第三权重因子;Based on the domain name contained in the URL of all approximation maps, the whois query is performed in the second database, and/or the IP address included in the URL is queried in the second database based on the IP address pointed to by the URLs of all approximate maps. Or the IP address of the same network segment, and output a third weighting factor according to the query result of the whois query and/or the IP address;
    步骤d),综合第一权重因子和第二权重因子以及第三权重因子,对所述图片是否属于有害图片进行识别。Step d), integrating the first weighting factor and the second weighting factor and the third weighting factor to identify whether the picture belongs to a harmful picture.
PCT/CN2018/072242 2017-12-30 2018-01-11 Method and system for identifying malicious image on the basis of url paths of similar images WO2019127658A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711500074.1 2017-12-30
CN201711500074.1A CN110020258A (en) 2017-12-30 2017-12-30 A kind of method and system of the URL Path Recognition nocuousness picture based on approximate diagram

Publications (1)

Publication Number Publication Date
WO2019127658A1 true WO2019127658A1 (en) 2019-07-04

Family

ID=67064964

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/072242 WO2019127658A1 (en) 2017-12-30 2018-01-11 Method and system for identifying malicious image on the basis of url paths of similar images

Country Status (2)

Country Link
CN (1) CN110020258A (en)
WO (1) WO2019127658A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1761206A (en) * 2005-11-18 2006-04-19 郑州金惠计算机系统工程有限公司 Multifunctional management system for detecting erotic images and unhealthy information in network
CN1968408A (en) * 2006-04-30 2007-05-23 华为技术有限公司 Video code stream filtering method and filtering node
CN102880613A (en) * 2011-07-14 2013-01-16 腾讯科技(深圳)有限公司 Identification method of porno pictures and equipment thereof
US20140196144A1 (en) * 2013-01-04 2014-07-10 Jason Aaron Trost Method and Apparatus for Detecting Malicious Websites
CN104615760A (en) * 2015-02-13 2015-05-13 北京瑞星信息技术有限公司 Phishing website recognizing method and phishing website recognizing system
CN106055574A (en) * 2016-05-19 2016-10-26 微梦创科网络科技(中国)有限公司 Method and device for recognizing illegal URL

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100361451C (en) * 2005-11-18 2008-01-09 郑州金惠计算机系统工程有限公司 System for detecting eroticism and unhealthy images on network based on content
US8171107B2 (en) * 2008-03-03 2012-05-01 Kidzui, Inc. Method and apparatus for editing, filtering, ranking, and approving content
CN101605140B (en) * 2009-07-16 2012-10-03 阿里巴巴集团控股有限公司 Network user identity verification and authentication system and verification and authentication method
CN101853377B (en) * 2010-05-13 2012-10-17 复旦大学 Method for identifying content of digital video
CN102332028B (en) * 2011-10-15 2013-08-28 西安交通大学 Webpage-oriented unhealthy Web content identifying method
CN103605808B (en) * 2013-12-10 2016-03-30 合一网络技术(北京)有限公司 Based on the method and system that the UGC of search recommends
CN104992177A (en) * 2015-06-12 2015-10-21 安徽大学 Internet porn image detection method based on deep convolution nerve network
CN106101740B (en) * 2016-07-13 2019-12-24 百度在线网络技术(北京)有限公司 Video content identification method and device
CN106354800A (en) * 2016-08-26 2017-01-25 中国互联网络信息中心 Undesirable website detection method based on multi-dimensional feature

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1761206A (en) * 2005-11-18 2006-04-19 郑州金惠计算机系统工程有限公司 Multifunctional management system for detecting erotic images and unhealthy information in network
CN1968408A (en) * 2006-04-30 2007-05-23 华为技术有限公司 Video code stream filtering method and filtering node
CN102880613A (en) * 2011-07-14 2013-01-16 腾讯科技(深圳)有限公司 Identification method of porno pictures and equipment thereof
US20140196144A1 (en) * 2013-01-04 2014-07-10 Jason Aaron Trost Method and Apparatus for Detecting Malicious Websites
CN104615760A (en) * 2015-02-13 2015-05-13 北京瑞星信息技术有限公司 Phishing website recognizing method and phishing website recognizing system
CN106055574A (en) * 2016-05-19 2016-10-26 微梦创科网络科技(中国)有限公司 Method and device for recognizing illegal URL

Also Published As

Publication number Publication date
CN110020258A (en) 2019-07-16

Similar Documents

Publication Publication Date Title
US10778702B1 (en) Predictive modeling of domain names using web-linking characteristics
Vinayakumar et al. Evaluating deep learning approaches to characterize and classify malicious URL’s
US9430553B2 (en) Application representation for application editions
US9372901B2 (en) Searching for software applications based on application attributes
Xu et al. Efficient spam detection across online social networks
WO2015101337A1 (en) Malicious website address prompt method and router
US11604843B2 (en) Method and system for generating phrase blacklist to prevent certain content from appearing in a search result in response to search queries
US11537751B2 (en) Using machine learning algorithm to ascertain network devices used with anonymous identifiers
CN108718341B (en) Method for sharing and searching data
US20180060359A1 (en) Method and system to randomize image matching to find best images to be matched with content items
US20160085774A1 (en) Context based image search
US11399035B1 (en) Deep learning-based detection of phishing links
WO2019127660A1 (en) Method and system for identifying harmful pictures based on user id
Kotenko et al. An approach for stego-insider detection based on a hybrid nosql database
AU2012100470B4 (en) Anonymous whistle blower system with reputation reporting of anonymous whistle blowers
WO2019127652A1 (en) Method for identifying harmful video on basis of user id and credits content and system therefor
Bharti et al. Exploring machine learning techniques for fake profile detection in online social networks
He et al. Mobile app identification for encrypted network flows by traffic correlation
US20150269268A1 (en) Search server and search method
Hamon Android botnets for multi-targeted attacks
WO2019127654A1 (en) Method and system for identifying harmful videos on basis of user ip and credits content
WO2019127658A1 (en) Method and system for identifying malicious image on the basis of url paths of similar images
WO2019127656A1 (en) User ip and video copy-based harmful video identification method and system
WO2019127653A1 (en) Method for identifying harmful video on basis of credits content and system therefor
WO2019127659A1 (en) Method and system for identifying harmful video based on user id

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18894038

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18894038

Country of ref document: EP

Kind code of ref document: A1