CN111859234A - Illegal content identification method and device, electronic equipment and storage medium - Google Patents

Illegal content identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111859234A
CN111859234A CN202010494955.2A CN202010494955A CN111859234A CN 111859234 A CN111859234 A CN 111859234A CN 202010494955 A CN202010494955 A CN 202010494955A CN 111859234 A CN111859234 A CN 111859234A
Authority
CN
China
Prior art keywords
illegal
access
users
honeypot
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010494955.2A
Other languages
Chinese (zh)
Inventor
韩睿
李晓宇
李明
张伟东
张月鹏
王志慧
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Ultrapower Intelligent Data Technology Co ltd
Original Assignee
Beijing Ultrapower Intelligent Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Ultrapower Intelligent Data Technology Co ltd filed Critical Beijing Ultrapower Intelligent Data Technology Co ltd
Priority to CN202010494955.2A priority Critical patent/CN111859234A/en
Publication of CN111859234A publication Critical patent/CN111859234A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application discloses an illegal content identification method and device, electronic equipment and a storage medium. The illegal content identification method comprises the following steps: obtaining a target sample user set according to honeypot access records of users accessing honeypots, wherein the honeypots are baits which are generated based on webpage contents interested by the users and are used for attracting the users to access; acquiring to-be-analyzed access records of all users in the target sample user set, and performing statistical analysis on the acquired to-be-analyzed access records to determine a potential illegal website; and sending an access request to the determined potential illegal website, and determining whether the potential illegal website contains illegal contents or not according to response data returned by the potential illegal website. According to the embodiment of the application, the illegal content hidden in the deep network is effectively identified by actively attacking, the direct source or indirect origin of the high-risk information is determined, the difficulty in identifying the illegal content is reduced, and a basic guarantee is provided for illegal content treatment and website risk prevention and control.

Description

Illegal content identification method and device, electronic equipment and storage medium
Technical Field
The present application relates to the field of network technologies, and in particular, to an illegal content identification method and apparatus, an electronic device, and a storage medium.
Background
In the mobile internet era, access flow enables free websites to have economic income, so that some lawbreakers are in danger for guiding and popularizing a collar: and issuing illegal contents such as network loan advertisements, network gambling platform entrances, pornography pictures and the like. Often these sensitive, illegal contents are hidden in deep networks (referring to those non-surface web contents on the internet that cannot be indexed by standard search engines), are not available to search engines and manual retrieval, are difficult to identify, and have poor remediation and risk control effects.
Disclosure of Invention
In view of the above, the present application is made to provide an illegal content identification method, apparatus, electronic device and storage medium that overcome the above problems or at least partially solve the above problems.
According to an aspect of the present application, there is provided an illegal content identification method, including:
obtaining a target sample user set according to honeypot access records of users accessing honeypots, wherein the honeypots are baits which are generated based on webpage contents interested by the users and are used for attracting the users to access;
acquiring to-be-analyzed access records of all users in the target sample user set, and performing statistical analysis on the acquired to-be-analyzed access records to determine a potential illegal website;
And sending an access request to the determined potential illegal website, and determining whether the potential illegal website contains illegal contents or not according to response data returned by the potential illegal website.
According to another aspect of the present application, there is provided an illegal content recognition apparatus including:
the system comprises a sample unit, a target sample user set and a processing unit, wherein the sample unit is used for obtaining a target sample user set according to honeypot access records of honeypot access of users, and the honeypot is a bait which is generated based on webpage contents interested by the users and is used for attracting the users to access;
the statistical analysis unit is used for acquiring the access records to be analyzed of all the users in the target sample user set, performing statistical analysis on the acquired access records to be analyzed and determining potential illegal websites;
and the identification unit is used for sending the access request to the determined potential illegal website and determining whether the potential illegal website contains illegal contents or not according to response data returned by the potential illegal website.
In accordance with yet another aspect of the present application, there is provided an electronic device including: a processor; and a memory arranged to store computer executable instructions that, when executed, cause the processor to perform the method as described above.
According to a further aspect of the application, there is provided a computer readable storage medium, wherein the computer readable storage medium stores one or more programs which, when executed by a processor, implement a method as in any above.
According to the technical scheme, the target sample user set is obtained according to the honeypot access records of the users accessing the honeypots, wherein the honeypots are baits generated according to the web contents interested by the users and are used for inducing the users to access so as to obtain the access records of the users. After a target sample user set is obtained, carrying out statistical analysis according to the access records to be analyzed of each user, and determining a potential illegal website; and sending an access request to the potential illegal website, and determining whether the potential illegal website contains illegal contents or not according to response data returned by the potential illegal website. Therefore, the illegal content in the deep network is actively shot and effectively identified, a direct source or an indirect origin of the high-risk information is found, the illegal content identification efficiency is improved, and a basic guarantee is provided for illegal content remediation and website risk prevention and control. In addition, according to the embodiment of the application, a large number of professionals are not needed to deal with the reported complaints, and the labor cost is low.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the application. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 shows a schematic flow diagram of an illegal content identification method according to an embodiment of the present application;
FIG. 2 illustrates a schematic diagram of a process for forming a target sample user set according to one embodiment of the present application;
FIG. 3 illustrates a schematic flow chart of determining a potential illegitimate web site according to one embodiment of the present application;
FIG. 4 illustrates a flow diagram for review validation of potentially illegitimate websites according to one embodiment of the present application;
FIG. 5 shows a block diagram of an illegal content identification device according to an embodiment of the present application;
FIG. 6 shows a schematic structural diagram of an electronic device according to an embodiment of the present application;
FIG. 7 shows a schematic structural diagram of a computer-readable storage medium according to an embodiment of the present application.
Detailed Description
Exemplary embodiments of the present application will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present application are shown in the drawings, it should be understood that the present application may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
Aiming at sensitive and even illegal contents such as net credit advertisements, pornographic pictures and the like distributed on websites by lawbreakers in order to guide and promote, the traditional identification and remediation scheme mainly comprises the following steps:
1. and reporting and accepting. And providing a complaint channel on a website, and if a user complains on a certain website, performing real-time examination and manual shielding.
2. And (4) centralized processing by a risk control department. And searching and remedying the illegal domain name and illegal sensitive information by a risk control department.
3. And uniformly crawling by the search engine. Sensitive words and risk contents are filtered by the search engine, and illegal contents are prevented from being searched by common users.
The historical practice of the above several solutions has revealed some problems: for example, the acceptance of a report requires manual review, is extremely costly from a human standpoint, and has a gap in the number of professionals. The centralized processing scheme of the risk control department cannot be fully automated, and the strategy can be released in a small scale only through experience, so that the effect is not ideal. However, the solution of the search engine can only passively process the illegal contents, that is, if the website that the search engine does not capture cannot be intercepted or the like. In addition, interception and reporting are delayed, and timeliness is poor.
In contrast, the embodiment of the present application provides an illegal content identification scheme based on big data: and aggregating the user equipment which frequently accesses the illegal contents to form a target sample guest group. These users often have a tendency to access the internet, for example, people who have accessed illegal content websites often re-access other administrative websites, people who browse pornographic pictures on the internet view similar content and click again, and so on. By utilizing the characteristic of the sample, the illegal contents (which are often hidden in a deep network and cannot be retrieved by a search engine and a worker) are actively searched out by counting the access records of the sample and are correspondingly processed, so that a scheme for actively identifying the illegal contents is provided, the difficulty and labor cost of identifying the illegal contents are reduced, and the illegal content regulation and website risk control are greatly facilitated.
For the sake of understanding, some technical terms of the embodiments of the present application will be explained.
URL: uniform Resource Locator, Uniform Resource Locator. For example, www.baidu.com/q xxx is a URL that uniquely identifies a resource on a network.
Domain name: the name of a computer or group of computers on the internet, consisting of a string of names separated by dot numbers, is used to identify the location (sometimes also referred to as the geographical location) of the computer at the time of data transmission. For the aforementioned URL, www.baidu.com is its domain name.
Access record (or, access log): a user can generate a large amount of access records in the processes of surfing the internet, surfing the internet by using a mobile phone and the like, wherein the records comprise resource information such as URL (uniform resource locator), domain name and the like.
And (4) honeypot: the pre-arranged webpage contents such as titles and pictures with bait properties attract interested visitors to visit, and the content of honeypots is not real illegal content, but is disguised in an indirect way such as a title party, a ball, and the like to attract guest groups. An IP (Internet Protocol) or device accessing the honeypot may be tagged with a "fish" label.
"fishy smell": after any device accesses the honeypot, an access record including the honeypot URL is generated and aggregated into big data. Because of the unique certainty of the URL, this access record with the URL is "fishy". By "fishy" it is possible to quickly identify which devices are visiting honeypots and which are not.
Fig. 1 is a schematic flow chart illustrating an illegal content identification method according to an embodiment of the present application, and referring to fig. 1, the illegal content identification method includes the following steps:
and step S110, obtaining a target sample user set according to honeypot access records of user access honeypots, wherein the honeypots are baits which are generated based on webpage contents interested by the users and are used for attracting the user access.
According to the embodiment of the application, the honeypot is purposefully and preliminarily made according to the content of interest of the user, for example, the honeypot is a picture simulating a network gambling platform, the honeypot is used for attracting the user to visit so that the user is infected with fishy smell, and all the users infected with fishy smell are gathered to obtain the target sample user set.
It should be noted that the target sample user set includes honeypot access records of the sample users, and the honeypot access records include device model information, skip route information, access domain name information, URL information of the users, and IP address information obtained according to the access domain names, and the like.
And step S120, acquiring the access records to be analyzed of each user in the target sample user set, and performing statistical analysis on the acquired access records to be analyzed to determine potential illegal websites.
In practical application, access logs of a large number of users are aggregated together to form large data which is a basic raw material for filtering target information. And analyzing the statistical logs of a large number of users by utilizing a big data technology, intercepting the access records to be analyzed of the users infected with the fishy smell, and analyzing the access records of the users infected with the fishy smell to determine the potential illegal website.
As described above, users often have a tendency to visit the internet, people who visit illegal content websites often visit other administrative websites again, people who browse pornographic pictures on the internet see similar content and click again, and the like. Based on this, the embodiment of the present application analyzes all the access logs of the sample users who are contaminated with "fishy smell" so as to actively discover those websites that may contain illegal contents.
Note: the access record to be analyzed is an access record generated by the behavior of accessing the common webpage content for a period of time (such as one year, half year and 3 months) of the user, and the honeypot access record is not included. In consideration of the relationship between the number of the access records to be analyzed and the probability of discovering a potential illegal website, as many access records to be analyzed of the user as possible should be obtained, for example, all the access records which can be retrieved by the user are obtained.
Step S130, sending an access request to the determined potential illegal website, and determining whether the potential illegal website contains illegal contents or not according to response data returned by the potential illegal website.
In order to improve the identification accuracy, the embodiment of the application sends the access request to the determined potential illegal website, and determines whether the potential illegal website really contains the illegal content according to response data returned by the potential illegal website.
As shown in fig. 1, in the illegal content identification method according to the embodiment of the application, honeypot access records of users who have accessed honeypots are aggregated to obtain a target sample user set, access records to be analyzed of the users in the target sample user set are obtained and are subjected to statistical analysis to determine a potential illegal website, and further, whether the potential illegal website really contains illegal content is judged according to response data returned by the potential illegal website, so that the illegal content in a deep network can be effectively distinguished, a source or an indirect exit of high-risk information is determined, and a solid foundation guarantee is provided for illegal content management, website risk prevention and control and the like.
In specific implementation, the application scenarios in the embodiments of the present application include, but are not limited to: identifying whether a website contains illegal content in an objective sense of gambling, fraud, pornography, terrorism, etc. And identifying the crowd who frequently visits the same type of website, and performing user portrayal based on crowd characteristics. The method is used for fighting against network illegal crimes and identifying and preventing illegal behaviors such as obscency, fraud and the like by utilizing network propaganda and construction. And the communication infrastructure (such as WeChat, QQ and microblog) intercepts illegal external content, reduces the risk control cost and improves the risk control effect.
In general, an illegal content identification method according to an embodiment of the present application includes: the method comprises the steps of firstly, obtaining a target sample passenger group, secondly, processing access records of the target sample passenger group, and thirdly, confirming in an AI (Artificial Intelligence) model or Artificial identification mode. The following description is made with reference to fig. 2 to 4.
Step one, a target sample passenger group is obtained.
Fig. 2 is a schematic diagram illustrating a process of forming a target sample user set according to an embodiment of the present application, which is a target sample user group.
Acquiring a target sample user set according to honeypot access records of users accessing honeypots, making webpages containing honeypots, and putting the webpages into a network to attract the users to access; when the webpage is visited by a user, executing a JS script preset in the webpage to visit a target URL of the honeypot, adding a timestamp to the target URL and generating a honeypot visit record containing the target URL; and aggregating the honeypot access records accessing the same target URL to obtain a target sample user set.
Note: the same kind of target URLs are URLs of honeypots of the same kind, such as honeypots that all simulate pornographic content, or URLs that all simulate fraudulent content, and so on. It can be seen that the embodiment clusters the customers according to the honeypot types.
With reference to fig. 2, the target sample guest group acquisition method includes:
and (5) making a honeypot and putting.
And judging whether the user is attracted by the honeypot, if so, accessing the honeypot by the user to trigger JS buried points, accessing the unique URL, and polluting fishy smell. Otherwise, the user set is not counted.
For example, a honeypot is first made and placed in the public network with an obvious title or picture to attract the user to click. It should be emphasized that in the embodiment of the present application, a JS (JavaScript, which is a lightweight, interpreted, or just-in-time compiled programming Language) script is carried in an HTML (hypertext Markup Language) file of a honeypot, or is referred to as a JS embedded point, and the JS script accesses an elaborate URL with a timestamp in an inline manner, and is embodied in an access record of a user, where a honeypot access record is as follows:
www.yzt.xyz/?t=155328142&refer=www.baidu.com。
note: when the user is not attracted by the honeypots, the user is not interested in the honeypots and therefore does not click on the honeypots, and for the embodiment of the application, the user not attracted by the honeypots is not concerned about the honeypots.
Referring to fig. 2, a big data batch process aggregates all user devices that are fishy.
That is, all users visiting the honeypot are stained with "fish smell", and all users stained with the same kind of "fish smell" are classified into the same guest group.
The putting the webpage into the network to attract the user to access comprises two specific implementation modes, namely:
creating a text file robots.txt, and declaring that the webpage is allowed to be crawled by a search engine in the text file robots.txt so as to mix the webpage into a search result of the search engine; or, the webpage is delivered to the network in the form of online advertisements according to a preset delivery time period. Therefore, the honeypots are put in legally, and accidental injury is avoided.
Txt can be configured to match search engine crawling to mix honeypots into the search engine's query results, attracting interested users to access. In addition, the webpage can be delivered to the network in the form of online advertisements according to preset delivery time periods, for example, honeypots are delivered to online advertisements in cheap time periods to attract interested users to click.
The robots protocol, also called robots, txt, is an ASCII encoded text file stored in the root directory of a web site that typically tells the web spider of a web search engine what content in the web site should not be obtained by the search engine's web spider and what content can be obtained by the web spider.
Since a large amount (more than 50%) of traffic on the internet is crawlers, and the crawlers can invalidate features of target sample customers, the invalid samples need to be filtered out in the embodiment of the present application.
Referring to fig. 2, it is determined whether the request is a crawler or a malicious request, and if the request is not a crawler or a malicious request, the user sample is put into the user set, so as to form a target sample user set, and if the request is a crawler or a malicious request, the sample is filtered and not included in the user set. That is, obtaining the target sample user set according to the honeypot visit record of the user visiting the honeypot includes: and judging the current honeypot access record of the user accessing the honeypot, and if the access request corresponding to the current honeypot access record is a crawler or a malicious request, determining the current honeypot access record as an invalid sample and deleting the invalid sample.
It should be noted that how to determine whether the access request is a crawler or a malicious request belongs to the prior art, for example, whether the model of the user equipment is included may be checked by using a crawling library shared by a large internet company, so as to determine whether the access request is a crawler. Or, whether the web page is a crawler is determined according to a User Agent (UA) which is a component of a header field of a Hyper Text Transfer Protocol (HTTP). The User Agent is an identifier for providing information such as the type, operating system and version, browser rendering engine, browser language, browser plug-in and the like of the browser used by the User to the accessed website. The UA string is sent to the server each time a browser HTTP request is made.
Thus, a target sample guest group is obtained.
And step two, processing the access records of the target sample guest group.
And processing the access records of the target sample guest group in the step on the basis of the target sample guest group obtained in the step one. Specifically, the method includes the steps of obtaining access records to be analyzed of each user in a target sample user set, performing statistical analysis on the obtained access records to be analyzed, and determining a potential illegal website, and includes: obtaining access records to be analyzed of all users in the target sample user set by using a big data batch processing algorithm; aggregating the obtained access records to be analyzed according to domain names, and counting the access frequency of each domain name; and sorting the domain names from high to low according to the access frequency, extracting a plurality of domain names preset in the front as high-frequency domain names, and determining potential illegal websites by the high-frequency domain names.
The target sample client group is determined according to the embodiment of the application, so that the to-be-analyzed access records of the sample users are obtained, and potential illegal websites are mined and analyzed from the to-be-analyzed access records, so that illegal contents hidden in a deep network can be found out actively.
Fig. 3 is a schematic diagram illustrating a process of determining a potential illegitimate website according to an embodiment of the present application, and referring to fig. 3, the process of determining the potential illegitimate website includes:
And (4) cleaning the big data, and extracting the access records to be analyzed of each user in the target sample user set.
Analyzing statistical logs of mass users through big data batch processing algorithms such as Map Reduce and Spark, intercepting all access records of a target sample user group (namely a target sample user set), preliminarily cleaning after intercepting, and sequencing. Since the top ranked websites are usually the addresses of the advertisement SDK (Software Development Kit), the URLs of the search engine, the WeChat server, and the like, the possibility that the websites accessed with high frequency contain illegal contents is greatly improved after cleaning and filtering. Note: the advertisement SDK is an interface service that advertisers provide to developers to embed advertisements within app (application) applications or websites, providing the developers with a way to realize and profit.
And aggregating and statistically sorting according to domain names.
And aggregating the access records to be analyzed according to the domain name information in the access records to be analyzed, counting the access frequency of each domain name, and sequencing according to the access frequency from high to low, so that the potential illegal website can be obtained from the step.
For example, the total number of access records to be analyzed of all users in the target sample user set is 300 (only for example), 150 (only for example) records for accessing the a domain name, 100 (only for example) records for accessing the B domain name, and 50 (only for example) records for accessing the C domain name are obtained after the domain name aggregation, and are sorted from high to low according to the access frequency: and A-B-C, taking the preset number (for example, 2) of domain names as high-frequency domain names, wherein the high-frequency domain names are A and B.
The white list domain names are filtered.
For the high frequency domain names, the filtering is performed here to filter out the white list domain names, because the white list domain names are not considered as illegal websites.
That is, determining a potentially illegitimate web site from each high-frequency domain name includes: and matching each high-frequency domain name with the domain names in the preset white list, filtering the high-frequency domain names if the matching is successful, and determining the potential illegal website according to the filtered high-frequency domain names. It should be understood that in practical applications, the white list can be set according to requirements, and reliable domain names are added to the white list to avoid misjudgment.
In the above example, assuming that the white list includes the domain name B, the extracted high-frequency domain names a and B are matched with the white list, and then B is successfully matched, so that B is filtered out, and the domain name a remains.
And extracting the high-frequency domain name and determining the potential illegal website.
At this point, high frequency domain names are extracted to determine potential illegal websites. I.e., the website indicated by domain name a as a potentially illegitimate website.
In practical applications, the number of potential illegal websites may be huge, and in order to preferentially process important potential illegal websites, the access frequency base may be generated by calculation based on the preset weight and the access frequency of each potential illegal website, and the higher the access frequency base is, the higher the priority is, the more likely to be processed first (e.g., audit confirmation).
After the potential illegal website is determined, the embodiment of the application further verifies and confirms the potential illegal website. That is, determining whether the potentially illegitimate website contains illegitimate content according to response data returned by the potentially illegitimate website includes: inputting the response data into a deep learning model obtained by pre-training according to a historical data set, and if the deep learning model finds that illegal contents with the similarity greater than a threshold value with known illegal contents exist in the response data, outputting an identification result that a potential illegal website contains the illegal contents; or aggregating the response data in batches into a window for manual review to obtain the identification result of whether the potential illegal website contains illegal contents.
Specifically, in the manual review process, if the response data is a complete and renderable HTML file, aggregating the web pages into one window for displaying for manually judging whether the HTML file contains illegal contents; and if the response data is the byte codes of the JSON character strings or the pictures and the videos, the byte codes of the JSON character strings or the pictures and the videos are aggregated into a window after conversion processing, and the window is displayed for manual judgment of whether illegal contents are contained or not.
Fig. 4 is a schematic view illustrating a verification process for a potentially illegitimate website according to an embodiment of the present application, and referring to fig. 4, response data is obtained in batches for a potentially illegitimate website result set including the potentially illegitimate website.
For example, the crawler + forward proxy is used to obtain HTTP response data (HTTP response data) of each potentially illegal website to confirm whether the HTTP response data contains illegal contents in a broad sense. In practical application, a crawler strategy is established, response data are obtained by batch processing, the response data returned by the potential illegal websites are uniformly stored in a Hive database as samples, and AI or manual right confirmation is waited. Note: hive is a data warehouse tool based on Hadoop of an open source big data platform.
With continued reference to fig. 4, there are two auditing modes for the response data, AI auditing and manual auditing, respectively. In the embodiment of the application, a deep learning model obtained by pre-training according to a historical data set is obtained, for example, response data in a Hive database is pre-modeled by using a deep learning technology to obtain the deep learning model. If the deep learning model finds illegal contents with higher similarity (namely, the similarity between the contents in the current response data and the known illegal contents is calculated, and if the similarity is greater than a threshold value, the similarity is higher), illegal identification is marked on the current domain name to obtain an illegal content identification result, and the whole process is completely automatic.
If the response data is in a format such as a picture or a byte code, before the model is verified and confirmed, text extraction may be performed on the visualized HTML file, for example, text extraction is performed by using an OCR (Optical Character Recognition) technique, and then the extracted text is input into the deep learning model to obtain a Recognition result.
Compared with AI auditing, manual auditing is more accurate, but has high cost and low efficiency. The manual review is a way of manually judging the response data and obtaining the recognition result. In order to improve efficiency, the embodiment of the application aggregates a plurality of webpages belonging to different websites into one window for display so as to simultaneously check the plurality of webpages, and obtain an illegal content identification result.
Referring to fig. 4, the manual review differs according to the difference of the response data, if the HTTP response data is an HTML file that is complete and can be normally executed by the browser, i.e., a renderable HTML file, the response data is aggregated in batch into a window to be displayed for the manual review, and a person can subjectively determine whether the web page contains illegal content on the web page, obtain an illegal content identification result, and add a mark to the web page when the web page contains illegal content.
If the HTTP response data contains a JSON (JavaScript Object Notation) character string or a byte code of a picture or a video, the interface content in the form of a character or a character is input to the auxiliary program by the auxiliary program to determine whether the auxiliary program can analyze. For example, a regular and special character matching automaton is used for quickly searching sensitive words, and a byte code injection or reading program is used for rendering video streams or pictures, so that the purposes of being visible to human eyes and making subjective judgment are achieved.
When the auxiliary program cannot parse, the potential illegal website is deleted from the result set (namely, the potential illegal website result set). When the auxiliary program can analyze, the auxiliary program analyzes the response data into a renderable HTML file for manual review to obtain an illegal content identification result.
In practical application, after the identification result is determined by AI audit or manual audit, information such as domain name, IP address and the like of an illegal website containing illegal contents can be extracted, and the extracted contents are used as a historical data set of the deep learning model to perform optimization training, so as to improve the processing speed and accuracy of the deep learning model.
Therefore, for potential illegal websites, further verification and confirmation are completed, and for determining that the illegal content exists, the related information of the illegal content can be submitted to a corresponding cloud host, an operator and the like for service freezing. Or the data is submitted to a network monitoring department in batch as evidence, so that the network monitoring department can conveniently report, check and seal. And the system can also be communicated with a risk control center of a website, and can intercept an access request for illegal contents in time, so that the risk control effect is improved.
The same technical concept as the foregoing illegal content identification method, an embodiment of the present application further provides an illegal content identification apparatus, fig. 5 shows a block diagram of the illegal content identification apparatus according to an embodiment of the present application, and referring to fig. 5, an illegal content identification apparatus 500 according to an embodiment of the present application includes:
a sample unit 510, configured to obtain a target sample user set according to a honeypot visit record of a user visiting a honeypot, where the honeypot is a bait generated based on web content of interest to the user and used for attracting the user to visit;
a statistical analysis unit 520, configured to obtain to-be-analyzed access records of each user in the target sample user set, perform statistical analysis on the obtained to-be-analyzed access records, and determine a potential illegal website;
The identifying unit 530 is configured to send an access request to the determined potential illegal website, and determine whether the potential illegal website contains illegal content according to response data returned by the potential illegal website.
In this embodiment, the sample unit 510 is specifically configured to determine a current honeypot access record of a user accessing a honeypot, and if an access request corresponding to the current honeypot access record is a crawler or a malicious request, determine the current honeypot access record as an invalid sample and delete the invalid sample.
In the embodiment of the present application, the sample unit 510 is configured to make a webpage including honeypots, and put the webpage on a network to attract users to visit; when the webpage is visited by a user, executing a JS script preset in the webpage to visit a target URL of the honeypot, adding a timestamp to the target URL and generating a honeypot visit record containing the target URL; and aggregating the honeypot access records accessing the same target URL to obtain a target sample user set.
In this embodiment of the present application, the sample unit 510 is configured to create a text file robots.txt, and declare in the text file robots.txt that the web page is allowed to be crawled by a search engine, so as to mix the web page into a search result of the search engine; or, the webpage is delivered to the network in the form of online advertisements according to a preset delivery time period.
In the embodiment of the present application, the statistical analysis unit 520 is specifically configured to obtain, by using a big data batch processing algorithm, access records to be analyzed of each user in the target sample user set; aggregating the obtained access records to be analyzed according to domain names, and counting the access frequency of each domain name; and sorting the domain names from high to low according to the access frequency, extracting a plurality of domain names preset in the front as high-frequency domain names, and determining potential illegal websites by the high-frequency domain names.
In this embodiment, the statistical analysis unit 520 is configured to match each high-frequency domain name with a domain name in a preset white list, filter the high-frequency domain name if the matching is successful, and determine a potential illegal website by using the filtered high-frequency domain name.
In this embodiment of the application, the identifying unit 530 is specifically configured to input the response data into a deep learning model obtained by pre-training according to a historical data set, and if the deep learning model finds that an illegal content whose similarity to a known illegal content is greater than a threshold exists in the response data, output an identifying result that a potential illegal website contains the illegal content; or aggregating the response data in batches into a window for manual review to obtain the identification result of whether the potential illegal website contains illegal contents.
It should be noted that, for the specific implementation of the above device embodiment, reference may be made to the specific implementation of the corresponding method embodiment, which is not described herein again.
In summary, according to the technical scheme of the illegal content identification, the target sample user set is obtained according to the honeypot access records of users accessing honeypots, then statistical analysis is carried out according to the access records to be analyzed of the users in the target sample user set, potential illegal websites are determined, access requests are sent to the potential illegal websites, whether the potential illegal websites contain illegal contents or not is determined based on response data returned by the potential illegal websites, therefore, the illegal contents in deep nets are actively identified, direct sources or indirect outlets of high-risk information are determined, and basic guarantee is provided for follow-up illegal content management and website risk prevention and control.
It should be noted that:
the algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose devices may be used with the teachings herein. The required structure for constructing such a device will be apparent from the description above. In addition, this application is not directed to any particular programming language. It will be appreciated that a variety of programming languages may be used to implement the teachings of the present application as described herein, and any descriptions of specific languages are provided above to disclose the best modes of the present application.
In the description provided herein, numerous specific details are set forth. However, it is understood that embodiments of the application may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the application, various features of the application are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the application and aiding in the understanding of one or more of the various inventive aspects. However, the disclosed method should not be interpreted as reflecting an intention that: this application is intended to cover such departures from the present disclosure as come within known or customary practice in the art to which this invention pertains. Rather, as the following claims reflect, inventive aspects lie in less than all features of a single foregoing disclosed embodiment. Thus, the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this application.
Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components. All of the features disclosed in this specification (including any accompanying claims, abstract and drawings), and all of the processes or elements of any method or apparatus so disclosed, may be combined in any combination, except combinations where at least some of such features and/or processes or elements are mutually exclusive. Each feature disclosed in this specification (including any accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.
Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the application and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.
The various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. It will be appreciated by those skilled in the art that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functions of some or all of the components of the illegal content recognition device according to the embodiments of the present application. The present application may also be embodied as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing the present application may be stored on a computer readable medium or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.
For example, fig. 6 shows a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device 600 comprises a processor 610 and a memory 620 arranged to store computer executable instructions (computer readable program code). The memory 620 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. The memory 620 has a storage space 630 storing computer readable program code 631 for performing any of the method steps described above. For example, the memory space 630 for storing the computer readable program code may comprise respective computer readable program codes 631 for respectively implementing the various steps in the above method. The computer readable program code 631 may be read from or written to one or more computer program products. These computer program products comprise a program code carrier such as a hard disk, a Compact Disc (CD), a memory card or a floppy disk. Such a computer program product is typically a computer readable storage medium such as described in fig. 7. FIG. 7 shows a schematic diagram of a computer-readable storage medium according to an embodiment of the present application. The computer readable storage medium 700, in which a computer readable program code 631 for performing the method steps according to the application is stored, is readable by the processor 610 of the electronic device 600, which computer readable program code 631, when executed by the electronic device 600, causes the electronic device 600 to perform the respective steps of the method described above, in particular the computer readable program code 631 stored by the computer readable storage medium may perform the method shown in any of the embodiments described above. The computer readable program code 631 may be compressed in a suitable form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the application, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The application may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Claims (10)

1. An illegal content identification method, comprising:
obtaining a target sample user set according to honeypot access records of users accessing honeypots, wherein the honeypots are baits which are generated based on webpage contents interested by the users and are used for attracting the users to access;
Acquiring to-be-analyzed access records of all users in the target sample user set, and performing statistical analysis on the acquired to-be-analyzed access records to determine a potential illegal website;
and sending an access request to the determined potential illegal website, and determining whether the potential illegal website contains illegal contents or not according to response data returned by the potential illegal website.
2. The method of claim 1, wherein the deriving a target sample user set from honeypot visit records of users visiting honeypots comprises:
the current honeypot visit record of the user visiting the honeypot is judged,
and if the access request corresponding to the current honeypot access record is a crawler or malicious request, determining the current honeypot access record as an invalid sample and deleting the invalid sample.
3. The method of claim 1, wherein the deriving a target sample user set from honeypot visit records of users visiting honeypots comprises:
making a webpage containing honeypots, and putting the webpage into a network to attract users to visit;
when the webpage is visited by a user, executing a JS script preset in the webpage to visit a target URL of the honeypot, adding a timestamp to the target URL and generating a honeypot visit record containing the target URL;
And aggregating the honeypot access records accessing the same target URL to obtain a target sample user set.
4. The method of claim 3, wherein the posting the web page to a network to attract users to access comprises:
creating a text file robots.txt, and declaring that the webpage is allowed to be crawled by a search engine in the text file robots.txt so as to mix the webpage into a search result of the search engine;
or, the webpage is delivered to the network in the form of online advertisements according to a preset delivery time period.
5. The method of claim 1, wherein obtaining the access records to be analyzed of each user in the target sample user set, performing statistical analysis on the obtained access records to be analyzed, and determining potential illegal websites comprises:
obtaining access records to be analyzed of all users in the target sample user set by using a big data batch processing algorithm;
aggregating the obtained access records to be analyzed according to domain names, and counting the access frequency of each domain name;
and sorting the domain names from high to low according to the access frequency, extracting a plurality of domain names preset in the front as high-frequency domain names, and determining potential illegal websites by the high-frequency domain names.
6. The method of claim 5, wherein said identifying potentially illegitimate web sites from the high-frequency domain names comprises:
and matching each high-frequency domain name with the domain names in the preset white list, filtering the high-frequency domain names if the matching is successful, and determining the potential illegal website according to the filtered high-frequency domain names.
7. The method of any one of claims 1-6, wherein determining whether a potentially illegitimate website contains illegitimate content based on response data returned by the potentially illegitimate website comprises:
inputting the response data into a deep learning model obtained by pre-training according to a historical data set, and if the deep learning model finds that illegal contents with the similarity greater than a threshold value with known illegal contents exist in the response data, outputting an identification result that a potential illegal website contains the illegal contents;
or aggregating the response data in batches into a window for manual review to obtain the identification result of whether the potential illegal website contains illegal contents.
8. An illegal content recognition device, comprising:
the system comprises a sample unit, a target sample user set and a processing unit, wherein the sample unit is used for obtaining a target sample user set according to honeypot access records of honeypot access of users, and the honeypot is a bait which is generated based on webpage contents interested by the users and is used for attracting the users to access;
The statistical analysis unit is used for acquiring the access records to be analyzed of all the users in the target sample user set, performing statistical analysis on the acquired access records to be analyzed and determining potential illegal websites;
and the identification unit is used for sending the access request to the determined potential illegal website and determining whether the potential illegal website contains illegal contents or not according to response data returned by the potential illegal website.
9. An electronic device, comprising: a processor; and a memory arranged to store computer-executable instructions that, when executed, cause the processor to perform the method of any one of claims 1-7.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores one or more programs which, when executed by a processor, implement the method of any of claims 1-7.
CN202010494955.2A 2020-06-03 2020-06-03 Illegal content identification method and device, electronic equipment and storage medium Pending CN111859234A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010494955.2A CN111859234A (en) 2020-06-03 2020-06-03 Illegal content identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010494955.2A CN111859234A (en) 2020-06-03 2020-06-03 Illegal content identification method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111859234A true CN111859234A (en) 2020-10-30

Family

ID=72985438

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010494955.2A Pending CN111859234A (en) 2020-06-03 2020-06-03 Illegal content identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111859234A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112634090A (en) * 2020-12-15 2021-04-09 深圳市彬讯科技有限公司 Home decoration information reporting management method, system, computer device and storage medium
CN112733057A (en) * 2020-11-27 2021-04-30 杭州安恒信息安全技术有限公司 Network content security detection method, electronic device and storage medium
CN113204695A (en) * 2021-05-12 2021-08-03 北京百度网讯科技有限公司 Website identification method and device
CN113505287A (en) * 2021-06-24 2021-10-15 微梦创科网络科技(中国)有限公司 Website link detection method and system
CN113505317A (en) * 2021-06-15 2021-10-15 山东伏羲智库互联网研究院 Illegal advertisement identification method and device, electronic equipment and storage medium
CN113852611A (en) * 2021-09-09 2021-12-28 上海理想信息产业(集团)有限公司 IP (Internet protocol) drainage method of website interception platform, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103795748A (en) * 2012-10-30 2014-05-14 工业和信息化部电信传输研究所 Method for downloading mobile internet website content information
US20140279614A1 (en) * 2013-03-14 2014-09-18 Wayne D. Lonstein Methods and systems for detecting, preventing and monietizing attempted unauthorized use and unauthorized use of media content
CN110324313A (en) * 2019-05-23 2019-10-11 平安科技(深圳)有限公司 The recognition methods of malicious user based on honey pot system and relevant device
CN110336811A (en) * 2019-06-29 2019-10-15 上海淇馥信息技术有限公司 A kind of Cyberthreat analysis method, device and electronic equipment based on honey pot system
CN110619075A (en) * 2018-06-04 2019-12-27 阿里巴巴集团控股有限公司 Webpage identification method and equipment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103795748A (en) * 2012-10-30 2014-05-14 工业和信息化部电信传输研究所 Method for downloading mobile internet website content information
US20140279614A1 (en) * 2013-03-14 2014-09-18 Wayne D. Lonstein Methods and systems for detecting, preventing and monietizing attempted unauthorized use and unauthorized use of media content
CN110619075A (en) * 2018-06-04 2019-12-27 阿里巴巴集团控股有限公司 Webpage identification method and equipment
CN110324313A (en) * 2019-05-23 2019-10-11 平安科技(深圳)有限公司 The recognition methods of malicious user based on honey pot system and relevant device
CN110336811A (en) * 2019-06-29 2019-10-15 上海淇馥信息技术有限公司 A kind of Cyberthreat analysis method, device and electronic equipment based on honey pot system

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112733057A (en) * 2020-11-27 2021-04-30 杭州安恒信息安全技术有限公司 Network content security detection method, electronic device and storage medium
CN112634090A (en) * 2020-12-15 2021-04-09 深圳市彬讯科技有限公司 Home decoration information reporting management method, system, computer device and storage medium
CN113204695A (en) * 2021-05-12 2021-08-03 北京百度网讯科技有限公司 Website identification method and device
CN113204695B (en) * 2021-05-12 2023-09-26 北京百度网讯科技有限公司 Website identification method and device
CN113505317A (en) * 2021-06-15 2021-10-15 山东伏羲智库互联网研究院 Illegal advertisement identification method and device, electronic equipment and storage medium
CN113505287A (en) * 2021-06-24 2021-10-15 微梦创科网络科技(中国)有限公司 Website link detection method and system
CN113852611A (en) * 2021-09-09 2021-12-28 上海理想信息产业(集团)有限公司 IP (Internet protocol) drainage method of website interception platform, computer equipment and storage medium
CN113852611B (en) * 2021-09-09 2023-05-09 上海理想信息产业(集团)有限公司 IP drainage method of website interception platform, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111859234A (en) Illegal content identification method and device, electronic equipment and storage medium
CN104766014B (en) Method and system for detecting malicious website
US8972401B2 (en) Search spam analysis and detection
US10212175B2 (en) Attracting and analyzing spam postings
US9430577B2 (en) Search ranger system and double-funnel model for search spam analyses and browser protection
CN101971591B (en) System and method of analyzing web addresses
US10547691B2 (en) System and method for main page identification in web decoding
US8667117B2 (en) Search ranger system and double-funnel model for search spam analyses and browser protection
CN103401835A (en) Method and device for presenting safety detection results of microblog page
CN108334641B (en) Method, system, electronic equipment and storage medium for collecting user behavior data
WO2011041465A1 (en) Enhanced website tracking system and method
CN103218431A (en) System and method for identifying and automatically acquiring webpage information
CN114422211A (en) HTTP malicious traffic detection method and device based on graph attention network
Koide et al. To get lost is to learn the way: Automatically collecting multi-step social engineering attacks on the web
CN103440454B (en) A kind of active honeypot detection method based on search engine keywords
CN115280305A (en) Heterogeneous graph clustering using inter-point mutual information criterion
CN117221135A (en) Data analysis method, device, equipment and computer readable storage medium
CN108804501A (en) A kind of method and device of detection effective information
AU2013221949C1 (en) Online content collection
CN116318974A (en) Site risk identification method and device, computer readable medium and electronic equipment
Dennis et al. Data mining approach for user profile generation on advertisement serving
US8909795B2 (en) Method for determining validity of command and system thereof
Di Tizio et al. A calculus of tracking: Theory and practice
CN108804444B (en) Information capturing method and device
CN108664489B (en) Website content monitoring method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201030

WD01 Invention patent application deemed withdrawn after publication