CN114329287A - Abnormal link processing method and device, computer equipment and storage medium - Google Patents

Abnormal link processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN114329287A
CN114329287A CN202111242832.0A CN202111242832A CN114329287A CN 114329287 A CN114329287 A CN 114329287A CN 202111242832 A CN202111242832 A CN 202111242832A CN 114329287 A CN114329287 A CN 114329287A
Authority
CN
China
Prior art keywords
content
link
abnormal
search
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111242832.0A
Other languages
Chinese (zh)
Inventor
唐亚腾
谢锦汉
钟滨
徐进
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111242832.0A priority Critical patent/CN114329287A/en
Publication of CN114329287A publication Critical patent/CN114329287A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses an abnormal link processing method, an abnormal link processing device, computer equipment and a storage medium; the method and the device for searching the link can acquire the search link to be detected; searching content based on the search links to obtain a webpage structure corresponding to each search link; analyzing the webpage structure to obtain the description information of the search link on at least one content dimension; aiming at the description information of each content dimension, carrying out anomaly detection on the search links by adopting a corresponding anomaly link detection strategy to obtain an anomaly detection result of each content dimension; and carrying out blocking processing on the search link based on the abnormal detection result, thereby improving the accuracy of processing the abnormal link.

Description

Abnormal link processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for processing an exception link, a computer device, and a storage medium.
Background
The exception link may include a link having an exception condition. For example, the exception link may include a link in which the address of the server has been changed, resulting in a failure to find the corresponding address. As another example, the exception link may include a link that does not properly display the corresponding web page content. When a user searches for content by using a search engine, if too many abnormal links are included in the search result, the search effect will be affected, and thus the experience of the user using the search engine will be affected. Through the practice of the prior art, the inventor of the application finds that the existing method for processing the abnormal link has the problem of low accuracy.
Disclosure of Invention
The embodiment of the application provides an abnormal link processing method, an abnormal link processing device, computer equipment and a storage medium, and can improve the accuracy of processing abnormal links.
The embodiment of the application provides an abnormal link processing method, which comprises the following steps:
acquiring a search link to be detected;
searching content based on the search links to obtain a webpage structure corresponding to each search link;
analyzing the webpage structure to obtain the description information of the search link on at least one content dimension;
aiming at the description information of each content dimension, carrying out anomaly detection on the search link by adopting a corresponding anomaly link detection strategy to obtain an anomaly detection result of each content dimension;
and carrying out blocking processing on the search link based on the abnormal detection result.
Correspondingly, an embodiment of the present application further provides an exception link processing apparatus, including:
the acquisition unit is used for acquiring a search link to be detected;
the content searching unit is used for searching contents based on the searching links to obtain a webpage structure corresponding to each searching link;
the analysis unit is used for analyzing the webpage structure to obtain the description information of the search link on at least one content dimension;
the anomaly detection unit is used for carrying out anomaly detection on the search link by adopting a corresponding anomaly link detection strategy aiming at the description information of each content dimension to obtain an anomaly detection result of each content dimension;
and the forbidding unit is used for carrying out forbidding processing on the search link based on the abnormal detection result.
In one embodiment, the abnormality detection unit includes:
a filtering content detection subunit, configured to perform filtering content detection on the original web page content;
the first analysis subunit is used for analyzing the original webpage content to obtain main content in the original webpage content when the original webpage content does not include preset filtering content;
and the content detection subunit is used for carrying out content detection on the main content to obtain the abnormal detection result.
In one embodiment, the content detection subunit includes:
the text detection module is used for performing text detection on the main content;
and the abnormal keyword detection module is used for detecting abnormal keywords of the text content to obtain an abnormal detection result when the main content is detected to comprise the text content.
In one embodiment, the abnormal keyword detection module includes:
the word segmentation sub-module is used for carrying out word segmentation processing on the text content to obtain at least one text sub-word;
the keyword matching sub-module is used for matching the text sub-words with preset abnormal keywords to obtain keyword matching results;
and the result generation submodule is used for generating the abnormal detection result based on the keyword matching result.
In one embodiment, the result generation submodule is configured to:
when the text sub-words are not matched with the preset abnormal keywords, performing semantic extraction on the text content to obtain semantic features of the text content;
respectively carrying out forward coding and backward coding on the semantic features to obtain forward coding information corresponding to the forward coding and backward coding information corresponding to the backward coding;
fusing the forward coding information and the backward coding information to obtain fused coding information;
and calculating the abnormal probability of the search link based on the fused coding information to obtain the abnormal detection result.
In an embodiment, the abnormal keyword detection module further includes:
the keyword acquisition submodule is used for acquiring initial abnormal keywords;
the expansion submodule is used for performing expansion processing on the initial abnormal keywords to obtain expanded abnormal keywords;
the abnormal link searching submodule is used for performing abnormal link searching based on the expanded abnormal keywords to obtain an abnormal link searching result;
and the screening submodule is used for screening the preset abnormal key words from the expanded abnormal key words based on the abnormal link search result.
In an embodiment, the content detection subunit further includes:
the image detection module is used for carrying out image detection on the main content when the main content does not comprise text content;
the character recognition module is used for carrying out character recognition on the image content when the main body content is detected to comprise the image content;
and the abnormal character detection module is used for detecting abnormal characters of the character information to obtain an abnormal detection result when the character information of the image content is identified.
In an embodiment, the content detection subunit further includes:
the clustering module is used for clustering the original webpage content to obtain a target cluster corresponding to the original webpage content when the main content is detected not to include the image content;
the first calculation module is used for calculating the similarity between the objects in the target cluster;
and the judging module is used for judging whether the search link is an abnormal link or not based on the similarity.
In one embodiment, the abnormality detection unit includes:
the second analysis subunit is used for analyzing the link content to obtain domain names of the link content on different levels;
the clustering processing subunit is configured to perform clustering processing on the link content based on domain names of the link content at different levels to obtain a target domain name cluster corresponding to the link content;
and the similarity judging subunit is used for judging the similarity of the target domain name cluster and generating the abnormal detection result based on the judgment result.
In an embodiment, the cluster processing subunit includes:
the second calculation module is used for calculating the distance between the domain name and a preset domain name in a plurality of preset domain name clusters;
and the determining module is used for determining a target domain name cluster of the domain name cluster from the plurality of preset domain name clusters based on the distance.
In one embodiment, the abnormality detection unit includes:
the state code matching subunit is used for matching the link state code with a preset abnormal state code to obtain a state code matching result;
and the mapping subunit is used for mapping the matching result to the corresponding abnormal detection result.
Embodiments of the present application also provide a computer program product or computer program comprising computer instructions stored in a computer-readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the method provided in the various alternatives of the above aspect.
Correspondingly, an embodiment of the present application further provides a storage medium, where the storage medium stores instructions, and the instructions, when executed by a processor, implement any one of the exception link processing methods provided in the embodiments of the present application.
The method and the device for searching the link can acquire the search link to be detected; searching content based on the search links to obtain a webpage structure corresponding to each search link; analyzing the webpage structure to obtain the description information of the search link on at least one content dimension; aiming at the description information of each content dimension, carrying out anomaly detection on the search links by adopting a corresponding anomaly link detection strategy to obtain an anomaly detection result of each content dimension; and carrying out blocking processing on the search link based on the abnormal detection result, thereby improving the accuracy of processing the abnormal link.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic view of a scenario of an exception link processing method according to an embodiment of the present application;
fig. 2 is a schematic flowchart of an exception link processing method according to an embodiment of the present application;
FIG. 3 is a schematic view of a scene of a web page structure provided in an embodiment of the present application;
FIG. 4 is a schematic view of a scene of image content provided by an embodiment of the present application;
FIG. 5 is a schematic flowchart of an exception link handling method according to an embodiment of the present application;
fig. 6 is a schematic diagram of another scenario of an exception link processing method according to an embodiment of the present application;
FIG. 7 is a schematic structural diagram of an exception link handling apparatus according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a computer device provided in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, however, the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The embodiment of the application provides an abnormal link processing method, which can be executed by an abnormal link processing device, and the abnormal link processing device can be integrated in computer equipment. Wherein the computer device may comprise at least one of a terminal and a server, etc. That is, the abnormal link processing method proposed in the embodiment of the present application may be executed by a terminal, may be executed by a server, or may be executed by both a terminal and a server capable of communicating with each other.
The terminal can be a smart phone, a tablet Computer, a notebook Computer, a Personal Computer (PC), a smart home, a wearable electronic device, a VR/AR device, a vehicle-mounted Computer, and the like. The server may be an interworking server or a background server among a plurality of heterogeneous systems, an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, cloud functions, cloud storage, network service, cloud communication, middleware service, domain name service, security service, big data and artificial intelligence platforms, and the like.
In an embodiment, as shown in fig. 1, the exception link processing apparatus may be integrated on a computer device such as a terminal or a server, so as to implement the exception link processing method provided in the embodiment of the present application. Specifically, the computer device may obtain a search link to be detected; searching content based on the search links to obtain a webpage structure corresponding to each search link; analyzing the webpage structure to obtain the description information of the search link on at least one content dimension; aiming at the description information of each content dimension, carrying out anomaly detection on the search links by adopting a corresponding anomaly link detection strategy to obtain an anomaly detection result of each content dimension; and carrying out blocking processing on the search link based on the abnormal detection result.
The following are detailed below, and it should be noted that the order of description of the following examples is not intended to limit the preferred order of the examples.
The embodiment of the present application will be described from the perspective of an abnormal link processing apparatus, which may be integrated in a computer device, where the computer device may be a server or a terminal.
As shown in fig. 2, an exception link processing method is provided, and a specific flow includes:
101. and acquiring a search link to be detected.
With the development of information technology and internet technology, people around the world are no longer split and isolated, but are integrated through the information technology and the internet technology. For example, through internet technology, people can search for information around the world. When the user searches for the data through the computer device, if too many abnormal links exist in the returned links, the use experience of the user is affected.
In an embodiment, the method for processing the abnormal link provided by the embodiment of the present application may be embedded in a search engine, and when a user searches for content through the search engine, the method for processing the abnormal link provided by the embodiment of the present application may be executed.
For example, when a user searches for material via a computer device, the computer device may receive a search trigger instruction and retrieve at least one search link based on the search trigger instruction.
Wherein a search link may include a connection from a web page to a target. For example, the search link may be a hyperlink, or the like.
In one embodiment, the search links have multiple presentation forms. For example, the search link may be a piece of text or a picture, and so on. When the user clicks the linked text or image, the current page jumps to the target page connected with the search link.
In one embodiment, the abnormal link may exist in at least one search link obtained by searching. Wherein the abnormal link comprises a link having an abnormal condition. For example, the exception link may include a link in which the address of the server has been changed, resulting in a failure to find the corresponding address. As another example, the exception link may include a link that does not properly display the corresponding web page content, and so on.
For example, the exception link may include a dead link. Wherein, the dead link means that the address of the server has changed and the link of the current address position cannot be found. The dead link can include two forms of a protocol dead link and a content dead link. The protocol link may include a dead link explicitly represented by the TCP/HTTP protocol state of the page, such as the common 404, 403, 503 states, etc. The content dead link may include an information page in which the server returns that the status is normal, but the content has been changed to non-existent, deleted, or required rights, which is not related to the original content.
When at least one searched link obtained by searching has an abnormal link, if the user clicks the abnormal link but the displayed content is not the corresponding content, the experience of the user in searching the content is affected. For example, a user searches for material using a search website and the search results display several search links. If the abnormal links exist in the plurality of search links, the experience of the user in using the search website is affected. As another example, a user searches for material using the search function of certain social software, and the search results display several search links. If abnormal links exist in the plurality of search links, the experience of the user using the search function in the social software is influenced.
In the application scenario of search, some of the search results obtained by the user are invalid or have failed. Identifying and cleaning up these invalid or stale results requires that the platform have the ability to identify the content of the web page link. The identification of the dead link of the webpage is an important basis for the survival of the platform, and the identification and filtering capability of the content of the dead link of the webpage directly influences the ecology of the whole platform and the experience of users. The platform needs to handle various abnormal and special conditions in the aspect of webpage dead link detection, such as the condition that a crawling result is different from a real page, and the target webpage is the condition that a whole picture has no text and is crawled by a target website.
Therefore, the abnormal link processing method provided by the embodiment of the application can effectively detect the abnormal link in the search link, thereby improving the accuracy of processing the abnormal link and improving the search experience of the user.
102. And searching the content based on the search link to obtain a webpage structure corresponding to the search link.
In an embodiment, after the search link is obtained, content search may be performed based on the search link to obtain a web page structure corresponding to the web page content of each search content.
The content search may include a process of acquiring a web page structure of a search link.
The web page structure may include, among other things, information describing the content of the web page and how the content is distributed among the content. Through the web page structure, the abnormal link processing device can know what the web page contents corresponding to the search link are and the layout of the web page contents in the web page.
In one embodiment, the web page structure may be represented in a variety of ways. For example, the web page structure may be represented by hypertext Markup Language (HTML). For another example, the web page structure may be represented by Extensible HyperText Markup Language (XHTML), and so on.
For example, when the web page structure is represented by HTML, it can be as shown in FIG. 3. 001 in fig. 3 may be a schematic example of a network structure corresponding to the search link.
In one embodiment, the web pages mentioned in the embodiments of the present application may include various web pages. For example, the web page may include a computer, the web page may include a mobile phone, the web page may include a tablet, the web page may include various internet of things devices, and the like.
In one embodiment, there are multiple ways to perform content search based on the search links, and obtain the web page structure corresponding to each search link.
For example, a web page structure corresponding to the search link to the web page content may be crawled in a crawler manner.
For another example, a web page structure corresponding to the web page content of the search link may be acquired using a developer mode of the web page.
In an embodiment, a large number of exposure logs and click logs can be obtained, and then the external features of the webpage structure content are counted after the webpage structure content is crawled through a crawler platform. Wherein the external feature may represent description information of the web page structure content.
103. And analyzing the webpage structure to obtain the description information of the search link on at least one content dimension.
In an embodiment, after the web page structure corresponding to the search link is obtained, the web page structure may be analyzed, so as to obtain the content corresponding to the search link.
For example, as shown, the information in the network structure in the figure may be content corresponding to a search link. When the user clicks on the search link, the information in the network structure is displayed in the form of web page content. For example, as shown in FIG. 3, when the structure of the web page is 001 in FIG. 3, the corresponding information can be 002 in FIG. 3.
In one embodiment, the information in the network structure may include a variety of content, and thus, the information in the network structure may be divided into description information in different content dimensions.
Wherein the description information may include information describing whether the search link is likely to be an abnormal link. For example, the description information may include original web page content, link content, and a link status code, among others.
Wherein the link status code may include information capable of explaining a search link status. For example, the link Status code may include an http (http Status code) Status code, and so on. For example, the link status code may include a 200 status code, a 202 status code, a 404 status code, a 403 status code, or a 503 status code, among others.
Wherein the 200 status code may indicate that the request for the web page was successful and the response header or data body desired for the request will be returned with the response. The 202 status code may indicate that the server has received a request for a web page but has not yet processed. 404 the status code may indicate that the search link corresponds to a web page or file not found. The 403 status code may indicate that the resources of the web page are unavailable. The 503 status code may indicate that the web page has no relevant results.
In one embodiment, the search link is essentially an address of the web page content, and thus the web page content may be obtained by searching the link. Wherein the link content may include an address representation of the search link. For example, the link content may be a web site, and so on. For example, the link content of the search link may be "https:// www.XXX.com". As another example, the link content of the search link may be "https:// www.XXX.com/item/xx.459253," and so on.
In one embodiment, the original web page content may include content in the web page structure other than the link content and the link status code. For example, the original web page content may include text content, image content, video content, animation content, and table content in the web page, among others.
In an embodiment, there are multiple ways to parse the web page structure to obtain the description information linked to at least one content dimension.
For example, the description information of different content dimensions has corresponding identifiers, and thus, the description information linked to at least one content dimension can be obtained by identifying the identifiers carried by the web page structure.
For another example, when the web page structure is represented by HTML, tags in the HTML can be identified to obtain descriptive information for searching links in at least one content dimension.
In an embodiment, parsing the web page structure may result in the description information of the search linked to at least one content dimension, or may result in the description information of the search linked to multiple content dimensions.
For example, the web page structure is parsed, and the obtained description information may only include the link content. For another example, the web page structure is parsed, and the obtained description information may include only the link content and the link status code. For another example, the web page structure is parsed, and the obtained description information may include only the link content and the original web page content. For another example, the web page structure is parsed, and the obtained description information may include link content, original web page content, and link status code, etc.
104. And aiming at the description information of each content dimension, carrying out anomaly detection on the search link by adopting a corresponding anomaly link detection strategy to obtain an anomaly detection result of each content dimension.
In an embodiment, after obtaining the description information of the search link on at least one content dimension, the search link may be subjected to anomaly detection by using a corresponding anomaly link detection policy with respect to the description information of each content dimension, so as to obtain an anomaly detection result of each content dimension.
The abnormal link detection strategy comprises a rule which needs to be followed when the abnormal detection is carried out on the search link aiming at the description information.
In one embodiment, the description information of different content dimensions corresponds to different abnormal link detection strategies. For example, when the description information includes link content and a link status code, the link content may correspond to one abnormal link detection policy, and the link status code may correspond to another abnormal link detection policy. For another example, when the description information includes link content, a link status code, and original web content, the link content may correspond to the abnormal link detection policy a, the link status code may correspond to the abnormal link detection policy B, and the original web content may correspond to the abnormal link detection policy C.
In an embodiment, when the description content is the original web page content, it may be determined whether the original web page content has content that needs to be filtered, and when the original web page content does not include the filtered content, the original web page content may be detected, so as to determine whether the search link is an abnormal link. Specifically, the step of performing anomaly detection on the search link by using a corresponding anomaly link detection strategy according to the description information of each content dimension to obtain an anomaly detection result of each content dimension may include:
performing filtering content detection on original webpage content;
when the original webpage content does not comprise preset filtering content, analyzing the original webpage content to obtain main content in the original webpage content;
and performing content detection on the main content to obtain an abnormal detection result.
When the original webpage content has the filtering content, the original webpage content is not detected, and whether the search link is an abnormal link or not is judged in other modes. For example, when the original web content is empty, it indicates that the original web content has no substantial content, and if the original web content is still detected to determine whether the search link is an abnormal link, the error rate of detecting the abnormality of the search link will be increased.
Therefore, when the description information is the original webpage content, the original webpage content can be firstly subjected to filtering content detection, and when the original webpage content does not include the preset filtering content, whether the search link is an abnormal link can be judged by detecting the original webpage content. For example, it may be determined whether the original web content is empty content, and when the original web content is empty content, other abnormal link detection strategies are adopted to perform abnormal detection on the search link. And when the original webpage content is not the empty content, whether the search link is an abnormal link can be judged by detecting the original webpage content.
In one embodiment, there are various ways to detect the original web content, so as to determine whether the original web content includes the preset filtering content. For example, the original web page content may be traversed. For another example, when the original web content is described by HTML, the HTML tag in the original web content may be detected, so as to determine whether the web content includes the preset filtering content.
In one embodiment, when the original web page content includes the preset filtering content, other abnormal link detection strategies may be adopted to perform abnormal detection on the search link.
In an embodiment, when the original web content does not include the preset filtering content, it may be determined whether the search link is an abnormal link based on the main content of the original web content. For example, when the original web content is not empty content, it may be determined whether the search link is an abnormal link based on the main content of the original web content. Therefore, when the original webpage content does not include the preset filtering content, the original webpage content can be analyzed to obtain the main content in the original webpage content.
The main content may include content having a substantial meaning in the original web page content. For example, the body content may include text content, image content, video content, animation content, and table content in the original web content, among others.
In one embodiment, when the search link is an abnormal link, there is often a text prompt related to the original web page content corresponding to the search link. For example, when the search link is a dead link, the original web page content of the search link may have the relevant text of "sorry, you do not have the resource accessed. Therefore, when content detection is performed on the main content, it may be determined first whether or not there is text content in the main content. When text content is included in the main content, it may be determined whether the search link is an abnormal link based on the text content in the main content. Specifically, the step of performing content detection on the main content to obtain the abnormal detection result may include:
performing text detection on the main content;
and when the main content is detected to comprise the text content, carrying out abnormal keyword detection on the text content to obtain an abnormal detection result.
In one embodiment, the purpose of text detection on the main content is to determine whether the main content has text content. There are various ways in which text detection can be performed on subject content.
For example, when the original web page content is described by an HTML language, whether or not text content is included in the main content can be determined by detecting an HTML tag of the main content. For example, when the HTML tag of the subject content includes a text tag, it is stated that the subject content may include text content therein. Where text labels may include labels that work on text. For example, text tags may include < p >, < em >, < i >, and < h1>, etc. HTML tags
In one embodiment, when the text content is included in the main content, it may be determined whether the text content has an abnormal keyword. When the text content includes an abnormal keyword, it is indicated that the search link is likely to be an abnormal link. Therefore, when the main content is detected to comprise the main content, the text content can be subjected to abnormal keyword detection, and an abnormal detection result is obtained. Specifically, the step of performing abnormal keyword detection on the text content to obtain an abnormal detection result may include:
performing word segmentation processing on the text content to obtain at least one text sub-word;
matching the text sub-words with preset abnormal keywords to obtain a keyword matching result;
and generating an abnormal detection result based on the keyword matching result.
Where text sub-words may include words that constitute text content. For example, when the text content is "may be a web address with a mistake", the text sub-words may include "may", "is", "web address", and "has a mistake". For another example, when the text content is "the content of the website is deleted", the text sub-words may include "the website", "the content", "by", "delete", and the like.
In one embodiment, there are multiple ways to perform word segmentation on the text content to obtain at least one text sub-word. For example, the text content may be segmented using a segmentation tool such as chinese Language Processing (HanLp), jieba library, Language Technology Platform (LTP) or QQSeg to obtain at least one text sub-word.
In an embodiment, after obtaining at least one text sub-word, the text sub-word may be matched with a preset abnormal keyword to obtain a keyword matching result, and an abnormal detection result is generated based on the keyword matching result. The preset exception keyword may include a word stored in the exception link processing apparatus in advance.
In an embodiment, before storing the preset exception keyword in the exception link processing device, the exception link processing device may obtain an initial exception keyword, and generate the preset exception keyword through the initial exception keyword. Specifically, before the step of "matching the text subwords with the preset abnormal keywords", the method may include:
acquiring an initial abnormal keyword;
performing expansion processing on the initial abnormal keywords to obtain expanded abnormal keywords;
performing abnormal link search based on the expanded abnormal keywords to obtain abnormal link search results;
and screening preset abnormal keywords from the expanded abnormal keywords based on the abnormal link search result.
The initial abnormal keywords may include abnormal keywords that have not been subjected to expansion processing.
For example, the initial abnormal keywords may include abnormal keywords generated by human experience. For example, the text content corresponding to the known abnormal link may be manually marked with an abnormal keyword, and the marked abnormal keyword is used as the initial abnormal keyword.
As another example, the initial exception keywords may include derived exception keywords obtained by data mining of known exception links.
In one embodiment, generalization is poor since the initial exception keywords are all derived from known exception links. If a new abnormal keyword appears in a subsequent new known abnormal link, manual labeling or data mining needs to be performed on the known abnormal link, which affects the reliability and efficiency of detecting the abnormal link. Therefore, after the initial abnormal keywords are obtained, the initial abnormal keywords can be expanded to obtain expanded abnormal keywords; performing abnormal link search based on the expanded abnormal keywords to obtain abnormal link search results; and screening preset abnormal keywords from the expanded abnormal keywords based on the abnormal link search result, so that the generalization of the preset abnormal keywords is improved. Therefore, when the preset abnormal keyword is used for judging whether the search link is the abnormal link or not, the generalization of the abnormal keyword is predicted to be high, so that the accuracy of judging whether the search link is the abnormal link or not is improved, and the accuracy of processing the abnormal link is improved.
In an embodiment, the initial abnormal keyword may be expanded in a plurality of ways to obtain an expanded abnormal keyword. For example, the synonyms and synonyms of the initial abnormal keyword may be searched, and the searched keyword may be used as the expanded abnormal keyword. For another example, similar words similar to the initial abnormal keyword expression may be manually expanded, and the expanded similar words may be used as the expanded abnormal keywords, and so on.
In one embodiment, after the expanded abnormal keyword is obtained, the expanded abnormal keyword needs to be verified, and whether the expanded abnormal keyword can search out the abnormal link is judged through verification, so that the accuracy and reliability of the preset abnormal keyword are improved. Therefore, the abnormal link search can be carried out based on the expanded abnormal keywords, and the abnormal link search result is obtained. Then, the abnormal keywords are screened out from the expanded abnormal keywords based on the abnormal link search result.
For example, a search may be performed using the expanded abnormal keyword to obtain at least one keyword search link obtained based on the expanded abnormal keyword search. Then, whether the keyword search link is a dead link or not can be judged, so that an abnormal link search result of the expanded abnormal keyword is obtained.
The abnormal link search result can include the abnormal link searched by the expanded abnormal keyword and the number of the searched abnormal links.
Next, a preset abnormal keyword may be screened out from the expanded abnormal keywords based on the abnormal link search result. For example, whether the abnormal link can be searched based on the expanded abnormal keyword can be first judged by the abnormal link search result. When the expanded abnormal keyword can be searched for the abnormal link, the number of abnormal links that can be searched for based on the expanded abnormal keyword can be determined. When the number of the abnormal links is larger than or equal to a preset threshold value, the expanded abnormal key words can be used as the predicted abnormal key words. And when the expanded abnormal key words cannot search the abnormal links or the number of the searched abnormal links is less than the preset threshold value, the expanded abnormal key words are not used as the preset key words.
In an embodiment, the expanded abnormal keyword is used as the preset abnormal keyword, and the initial abnormal keyword is also used as the preset abnormal keyword. That is, when the text subwords are matched with the preset abnormal keywords, the preset abnormal keywords include the initial abnormal keywords and the screened expanded abnormal keywords.
In an embodiment, after the preset abnormal keyword is obtained, the text subwords and the preset abnormal keyword can be matched to obtain a keyword matching result, and then an abnormal detection result is generated based on the keyword matching result. That is, whether the text sub-words of the text content include the preset abnormal keywords may be detected. When the text sub-words include predicted-abnormal keywords, then an abnormal detection result may be generated that the search link is an abnormal link. When the text sub-word does not include the predicted-abnormal keyword, then an abnormal detection result may be generated that the search link is not an abnormal link.
The abnormal detection result is used for explaining whether the search link is an abnormal link or not.
In one embodiment, the anomaly detection result may have a variety of representations. For example, it is possible to indicate that the search link is an abnormal link when the abnormality detection result is "1", and indicate that the search link is not an abnormal link when the abnormality detection result is "0". For another example, it may be assumed that the search link is an abnormal link when the abnormality detection result is "True", and the search link is not an abnormal link when the abnormality detection result is "False", or the like.
In an embodiment, in order to improve the accuracy of detecting the abnormal link, when the keyword detection is performed on the text content to obtain that the search link is not the abnormal link, the semantic features of the text content may be further extracted, and whether the search link is the abnormal link or not may be determined based on the semantic features of the text content. Specifically, the step "generating an anomaly detection result based on the keyword matching result" may include:
when the text sub-words are not matched with preset abnormal keywords, performing semantic extraction on the text content to obtain semantic features of the text content;
forward coding and backward coding are respectively carried out on the semantic features to obtain forward coding information corresponding to the forward coding and backward coding information corresponding to the backward coding;
fusing the forward coding information and the backward coding information to obtain fused coding information;
and calculating the abnormal probability of the search link based on the fused coding information to obtain an abnormal detection result.
Where the semantic features may include information describing the meaning of the textual content expression.
In one embodiment, there are multiple ways to perform semantic extraction on the text content to obtain semantic features of the text content. For example, convolution operation may be performed on the text content to obtain semantic features of the text content. For another example, the text content may be sampled according to a preset step length, and semantic conversion may be performed on the text content obtained by sampling to obtain a semantic feature.
In an embodiment, in order to improve the detection accuracy, after the semantic features are obtained, forward coding and backward coding may be performed on the semantic features to obtain forward coding information corresponding to the forward coding and backward coding information corresponding to the backward coding.
Wherein the forward encoding may include encoding semantic features from front to back. For example, if the semantic feature is "content empty", forward encoding may refer to encoding in the order of "content empty" according to the semantic feature. And backward encoding may refer to encoding in an order that the semantic features are "empty as content".
In an embodiment, after the forward coding information and the backward coding information are obtained, the forward coding information and the backward coding information may be fused to obtain fused coding information.
For example, the forward encoded information and the backward encoded information may be added to obtain fused encoded information. For another example, the forward encoded information and the backward encoded information may be concatenated to obtain the fused encoded information.
By carrying out forward coding and backward coding on semantic features, the actual semantics of the text content can be considered, and the reverse semantics of the text content can also be considered, so that the information content of the fused coding information is higher, and the accuracy of calculating the abnormal probability of searching links based on the fused coding information can be improved.
In an embodiment, after obtaining the fused encoding information, an anomaly probability of the search link may be calculated based on the fused encoding information, and an anomaly detection result may be obtained based on the anomaly probability.
For example, the abnormality probability may be compared with a preset abnormality probability discrimination threshold, and an abnormality detection result may be determined from the comparison result.
For example, the predetermined anomaly probability determination threshold may be 90%. When the probability of abnormality is greater than or equal to 90%, then the search link may be an abnormal link. And when the abnormal probability is less than 90%, the search link is not an abnormal link.
For another example, semantic detection may be performed on the text content in another manner, and whether the search link is an abnormal link may be determined according to the detection result.
For example, a Long Short-Term Memory artificial neural network (LSTM) or a bidirectional Long Short-Term Memory artificial neural network (Bi-directional) may be used to perform semantic detection on the text content, and determine whether the search content is an abnormal link according to the detection result.
In one embodiment, when it is detected that the subject content does not include text content, it may be determined whether the search link is an abnormal link by other types of subject content. For example, as shown in fig. 4, when a search link is an abnormal link, the content of the web page corresponding to the search link may represent that the link is an abnormal link by an image. Therefore, whether the search link is an abnormal link or not can be judged by the image content. Specifically, the method provided by the embodiment of the present application further includes:
when detecting that the main content does not include the text content, performing image detection on the main content;
when detecting that the main content comprises image content, performing character recognition on the image content;
and when the character information of the image content is identified, carrying out abnormal character detection on the character information to obtain an abnormal detection result.
In an embodiment, when it is detected that the subject does not include the subject content, image detection may be performed on the subject content, thereby determining whether the image content is included in the subject content. There are many ways in which image detection may be performed on subject content.
For example, when the original web page content is described by an HTML language, whether or not image content is included in the body content can be determined by detecting an HTML tag of the body content. For example, when the HTML tag of the image content includes an image tag, it is described that the image content may be included in the body content. Wherein the image content may include tags that work for the image. For example, the image tag may be an < img > tag, and so on.
In an embodiment, when it is detected that the subject content includes image content, character recognition may be performed on the image content. The character recognition of the image content may include an operation of determining whether the image content includes character information. The character information may include characters, punctuation marks, numbers and other symbols.
By performing character recognition on the image content, it can be determined whether the image content has character information. When the image content has character information, abnormal character detection can be carried out on the character information to obtain an abnormal detection result.
In one embodiment, there are a number of ways in which character recognition may be performed on image content. For example, the image content may be subjected to Character Recognition using Optical Character Recognition (OCR), so as to determine whether the image content has Character information. For another example, the image content may be character recognized using a deep learning model. For example, the character recognition can be performed on the image content by using a Deep learning model such as a Convolutional Neural Network (CNN) or a Deep Convolutional Neural network (Deep CNN).
In an embodiment, when it is recognized that the image content includes character information, abnormal character detection may be performed on the character information, resulting in an abnormal detection result.
The character information may be detected in an abnormal keyword detection manner with reference to the text content. For example, the character information may be subjected to word segmentation processing, the segmented character information may be matched with a preset abnormal keyword, and an abnormal detection result may be generated according to the matching result. When the character information is detected and the search link is not an abnormal link, semantic recognition can be performed on the character information, so that whether the search link is an abnormal link or not is detected again.
In an embodiment, when the main content of the original web content does not include the text content and the image content, the original web content may be detected in another way, so as to determine whether the search link is an abnormal link. Specifically, the method for processing an exception link provided in the embodiment of the present application may further include:
when detecting that the main content does not comprise the image content, clustering the original webpage content to obtain a target cluster corresponding to the original webpage content;
calculating the similarity between objects in the target cluster;
and judging whether the search link is an abnormal link or not based on the similarity.
In an embodiment, when it is detected that the main content does not include the image content and the text content, the original web content may be clustered as a whole to obtain a target cluster corresponding to the original web content.
The process in which a collection of physical or abstract objects is divided into classes composed of similar objects is called clustering. The cluster generated by clustering is a collection of a set of data objects that are similar to objects in the same cluster and distinct from objects in other clusters.
The original webpage content is clustered in various ways to obtain a target cluster corresponding to the original webpage content. For example, a K-Means Clustering Algorithm (K-Means Clustering Algorithm, K-MEANS) or a Clustering Algorithm based on random selection (CLARANS) can be adopted to perform Clustering processing on the original web pages to obtain target clusters corresponding to the original web pages. For another example, the original web page may be clustered by using a Hierarchical-based Balanced Iterative Clustering method (BIRCH), a decomposition Hierarchical Clustering method (DHC), or an aggregation Hierarchical Clustering method (AHC), so as to obtain a target cluster corresponding to the original web page.
In one embodiment, when the search link is an abnormal link, the objects in the target cluster corresponding to the original web page content may have a characteristic of high similarity. Therefore, after the target cluster corresponding to the original webpage content is determined, the similarity between the objects in the target cluster can be calculated, and whether the search link is an abnormal link or not can be judged based on the similarity.
In the embodiment of the present application, the object in the target cluster may include original web page contents of different search links clustered together.
In one embodiment, the similarity between objects in the target cluster may be calculated by a euclidean distance or a pearson correlation coefficient. When the similarity between the objects in the target cluster is greater than or equal to a preset similarity threshold, the search link can be judged to be an abnormal link. When the similarity between the objects in the target cluster is smaller than a preset similarity threshold, it can be judged that the search link is not an abnormal link.
For example, the preset similarity threshold may be set to 60%, and when the similarity between the objects in the target cluster is greater than or equal to 60%, it may be determined that the search link is an abnormal link.
In an embodiment, when the description information is the link content, an abnormal link detection policy corresponding to the link content may be adopted to detect the link information, so as to determine whether the search link is an abnormal link according to an abnormal detection result. Specifically, the step of performing anomaly detection on the search link by using a corresponding anomaly link detection strategy according to the description information of each content dimension to obtain an anomaly detection result of each content dimension may include:
analyzing the link content to obtain domain names of the link content on different levels;
clustering the link content based on domain names of the link content on different levels to obtain a target domain name cluster corresponding to the link content;
and carrying out similarity discrimination on the target domain name cluster, and generating an abnormal detection result based on the discrimination result.
The domain name may include a name of a certain device or group of devices on the internet, which is composed of a string of names separated by dots, and is used for indicating the location of the computer device during data transmission. By the domain name, the domain name and the IP (Internet protocol) address can be mapped with each other, so that people can more conveniently visit the Internet.
In one embodiment, domain names may be ranked differently according to their location ranges. For example, the domain names may be divided into a root domain name, a primary domain name, and a secondary domain name, among others.
The root domain name is the highest level of domain name node, and the general link content carries the root domain name. For example, the root domain name may represent a name including ". root", and so on.
Wherein the domain name at the next level of the root domain name is a first level domain name. For example, the primary domain name may include ". com", ". org", ". net", ". cn", and so on.
Wherein the domain name at the next level of the first-level domain name is a second-level domain name. For example, the second level domain name may include "web1. web. cn", "web2. web1.web. cn", and so on
In one embodiment, the linked content may consist of a domain name. For example, the link content may be "host name, secondary domain name, primary domain name, root domain name". For example, when the link content is "www.web1.com.root,". web1 "may be a secondary domain name,". com "may be a primary domain name,". root "may be a root domain name. Therefore, the link content can be analyzed to obtain domain names of the link content on different levels. It is then determined whether the search link is an anomalous link based on the domain name at the different level.
In an embodiment, domain names of the link content on different levels can be obtained by traversing the link content and then matching the traversed link content with domain name identification information of a preset level.
For example, the link content is "www.web1.com.root". The link contents are traversed to obtain the link contents of ". www", ". web 1", ". com", and ". root". Then, the link content may be matched with the preset level domain name identification information, so as to obtain domain names on different levels.
In an embodiment, after the domain names of the link content at different levels are obtained, clustering processing may be performed on the link content based on the domain names of the link content at different levels to obtain a target domain name cluster corresponding to the link content.
There are various methods for clustering domain names at different levels.
For example, the domain names at different levels may be clustered by using algorithms such as K-MEANS, CLARANS, or BIRCH, to obtain a target domain name cluster corresponding to the link content.
For another example, the distance between the domain name and a preset domain name in a plurality of preset domain name clusters may be calculated, and then a target domain name cluster of the domain name cluster may be determined from the plurality of preset domain name clusters based on the distance. Specifically, the step of performing clustering processing on the link content based on domain names of the link content at different levels to obtain a target domain name cluster corresponding to the link content may include:
calculating the distance between the domain name and a preset domain name in a plurality of preset domain name clusters;
determining a target domain name cluster of the domain name cluster from a plurality of preset domain name clusters based on the distance.
There are various ways to calculate the distance between the domain name and the preset domain name in the preset domain name clusters. For example, the distance between the domain name and the preset domain name in the preset domain name clusters can be calculated by using a euclidean distance or a pearson correlation coefficient.
Then, a target domain name cluster of the domain name clusters may be determined from a plurality of preset domain name clusters based on the distance. For example, a preset domain name cluster with the smallest distance may be selected as the target domain name cluster.
For example, when the link contents include a primary domain name and a secondary domain name, a distance between the primary domain name of the link contents and a preset domain name in a plurality of preset domain name clusters may be calculated. In addition, the distance between the secondary domain name of the link content and the preset domain name in the plurality of preset domain name clusters can be calculated. And then, adding the two distances, and determining a target domain name cluster of the domain name cluster from a plurality of preset domain name clusters based on the distance obtained by adding.
For example, the link content includes a primary domain name a1 and a secondary domain name a 2. The preset domain name clusters include a preset domain name cluster b1, a preset domain name cluster b2 and a preset domain name cluster b 3. Then, the distance c11 between the preset domain name clusters in a1 and b1 can be calculated; calculating the distance c12 between preset domain name clusters in a1 and b 2; the distance c13 between the preset domain name clusters in a1 and b3 is calculated. Similarly, the distance c21 between the preset domain name clusters in a2 and b1 can be calculated; calculating the distance c22 between preset domain name clusters in a2 and b 2; the distance c23 between the preset domain name clusters in a2 and b3 is calculated. Then, c11 and c21 may be added to give c 1; c12 and c22 are added to obtain c 2; c13 and c23 were added to give c 3. Then, a target domain name cluster of the domain name cluster may be determined from the plurality of preset domain name clusters based on c1, c2, and c 3. For example, when c1 is minimum, the preset domain name cluster b1 may be determined as the target domain name cluster.
In an embodiment, when the description information includes a link status code, the link status code may be detected by using an abnormal link detection policy corresponding to the link status code, so as to determine whether the search link is an abnormal link result according to an abnormal detection result. Specifically, the step of performing anomaly detection on the search link by using a corresponding anomaly link detection strategy according to the description information of each content dimension to obtain an anomaly detection result of each content dimension may include:
matching the link state code with a preset abnormal state code to obtain a state code matching result;
and mapping the matching result to the corresponding abnormal detection result.
The preset abnormal state code may include state codes 404, 403, 405, etc. indicating that the web page content is abnormal.
In an embodiment, the link status code and the default abnormal status code may be matched, so as to determine whether the link status code is the default abnormal status code. When the link state code is the preset abnormal state code, the search link is indicated to be an abnormal link. And when the link state code is not the preset abnormal state code, the search link is not the abnormal link.
105. And carrying out blocking processing on the search link based on the abnormal detection result.
In an embodiment, after obtaining the anomaly detection result for each content dimension, the search link may be blocked based on the anomaly detection result. For example, when it is detected that the search link is an abnormal link, the search link that is an abnormal link may be subjected to the blocking process.
Wherein the blocking process may include a process of clearing the exception link. By carrying out the banning processing on the search link, the user can not acquire the abnormal link when the computer equipment displays the search result to the user, so that the search experience of the user is improved.
In an embodiment, the abnormal link processing means may add an abnormal flag to the search link that is an abnormal link when detecting that the search link is an abnormal link. Then, when the search link is displayed to the user, the computer device may block the search link that is the abnormal link according to the abnormal identifier, so that the abnormal link is not included in the search link displayed to the user.
The embodiment of the application provides an abnormal link processing method, which comprises the following steps: acquiring a search link to be detected; searching content based on the search link to obtain a webpage structure corresponding to the search link; analyzing the webpage structure to obtain the description information of the search link on at least one content dimension; aiming at the description information of each content dimension, carrying out anomaly detection on the search links by adopting a corresponding anomaly link detection strategy to obtain an anomaly detection result of each content dimension; and carrying out blocking processing on the search link based on an abnormal detection result. According to the method and the device, the search link is subjected to abnormal detection based on the description information on different content dimensions, so that whether the search link is an abnormal link or not can be judged on multiple content dimensions. Whether the search links are abnormal links or not is judged on a plurality of content dimensions, so that the detection range of the search links can be expanded, the search links which are abnormal links can be prevented from being omitted when the abnormal links are detected, and the accuracy of detecting the search links is improved.
When detecting the text content, the keyword detection can be performed on the text content firstly, so that whether the search link is an abnormal link or not is judged. And when the keyword detection is carried out on the text content to obtain that the search link is not the abnormal link, semantic detection can be carried out on the text content, so that whether the search link is the abnormal link or not is judged. By carrying out keyword detection and semantic detection on the text content, the detection range of the text content can be improved, so that the accuracy of text detection is improved, and the accuracy of judging whether the search link is an abnormal link is improved.
In addition, the text cluster with high occurrence frequency can be screened out by using a text clustering mode aiming at the text content without semantics or messy codes, and whether the conditions of reverse crawling, accidental injury and missing calling exist or not can be identified through the text cluster. For the original webpage content which cannot be identified, a content clustering method can be adopted to judge the abnormal links, and the conditions of reverse crawling of the website, abnormal webpage structure and the like can be effectively processed.
For example, for a website with an anti-crawling function, the searched webpage structure is often abnormal, for example, there are many duplicate contents, or the searched webpage structure does not match the content of its links. At this time, according to the embodiment of the application, whether the search link corresponding to the website is a true abnormal link or is only the reason for the true abnormal link, so that the accuracy of detecting the abnormal link is improved.
The method described in the above examples is further illustrated in detail below by way of example.
The method of the embodiment of the present application will be described by taking an example in which an exception link processing method is integrated on a computer device.
In an embodiment, as shown in fig. 5, a method for processing an exception link includes the following specific steps:
201. the computer equipment acquires the search link to be detected.
For example, the computer device obtains a plurality of search links to be detected.
202. And the computing equipment searches the content based on the search link to obtain a webpage structure corresponding to the search link.
For example, the computer device may perform a content search on each search link, resulting in a web page structure corresponding to each search link.
203. The computer equipment analyzes the webpage structure to obtain the description information of the search link on at least one content dimension.
For example, the computer device may parse the web page structure of each search link to obtain descriptive information for each search link in at least one content dimension.
The description information on at least one content dimension may include original web page content, link content and a link status code.
Where different search links may have descriptive information in different content dimensions.
For example, the web page structure of some search links may be parsed to obtain link content and link status codes. As another example, some web page structures that search for links may be parsed to obtain link content and original web page content. As another example, some web page structures that search for links may be parsed to obtain link content, link status codes, and original web page content.
204. And the computer equipment adopts a corresponding abnormal link detection strategy to carry out abnormal detection on the search link aiming at the description information of each content dimension so as to obtain an abnormal detection result of each content dimension.
For example, as shown in fig. 6, when the description information of the search link includes link content, a link status code and original web content, the link content may correspond to one abnormal link detection policy, the link status code may correspond to another abnormal link detection policy, and the original web content may correspond to a third abnormal link detection policy.
The link status code can be matched with a preset abnormal status code. And when the link state code is matched with the preset abnormal state code, the search link corresponding to the link state code is an abnormal link. And when the link state code is not matched with the preset abnormal state code, the search link corresponding to the link state code is not the abnormal link.
The link content can be analyzed to obtain domain names of the link content on different levels. And then clustering the domain names of different grades, and judging whether the search link is an abnormal link according to a clustering result.
For example, the first level domain names of the linked content may be clustered, and the second level domain names of the linked content may be clustered. Then, whether the search link is an abnormal link can be judged according to the clustering result of the first-level domain name and the clustering result of the second-level domain name.
For the original web page content, it may be determined whether the content of the original web page content is empty or not. When the original webpage content is not empty content, the original webpage content can be subjected to text analysis, so that whether the original webpage content has text content or not is judged.
When the original webpage content has text content, abnormal keyword detection can be performed on the text content, so that whether a search link corresponding to the text content is an abnormal link or not is judged. When the text link is not an abnormal link obtained by detecting the abnormal key words of the text content, the text content can be converted into vector representation by using a word2vec method, then the vector representation is added into a BILSTM model to judge the text content, and whether the search link is an abnormal link is determined according to a judgment result. When the search link is determined to be an abnormal link through the text content, the search link may be disabled. By utilizing the BILSTM model to judge the text content, the dependency relationship of longer distance and bidirectional semantic dependency can be captured better.
In one embodiment, when the original web page content does not have text content, it can be determined whether the original web page content has image content. When the original webpage content has image content, OCR analysis can be performed on the image content to obtain text characters corresponding to the image content.
Then, abnormal keyword detection may be performed on the text characters and the text characters may be judged using the BILSTM model, thereby judging whether the search link is an abnormal link.
When the original webpage content does not have text content and image content, the original webpage content can be regarded as a whole, and then the original webpage content is clustered to obtain a clustering result. Then, it may be determined whether the search link is an abnormal link based on the clustering result.
205. The computer device blocks the search link based on the anomaly detection result.
In the embodiment of the application, computer equipment acquires a search link to be detected; searching content based on the search link to obtain a webpage structure corresponding to the search link; the computer equipment analyzes the webpage structure to obtain the description information of the search link on at least one content dimension; aiming at the description information of each content dimension, the computer equipment adopts a corresponding abnormal link detection strategy to carry out abnormal detection on the search link so as to obtain an abnormal detection result of each content dimension; and carrying out blocking processing on the search link based on the abnormal detection result. According to the method and the device, the search link is subjected to abnormal detection based on the description information on different content dimensions, so that whether the search link is an abnormal link or not can be judged on multiple content dimensions. Whether the search links are abnormal links or not is judged on a plurality of content dimensions, so that the detection range of the search links can be expanded, the search links which are abnormal links can be prevented from being omitted when the abnormal links are detected, and the accuracy of detecting the search links is improved.
In order to better implement the exception link processing method provided by the embodiment of the present application, in an embodiment, an exception link processing apparatus is further provided, and the exception link processing apparatus may be integrated in a computer device. The meaning of the noun is the same as that in the above abnormal link processing method, and specific implementation details can refer to the description in the method embodiment.
In an embodiment, an exception link handling apparatus is provided, and the exception link handling apparatus may be specifically integrated in a computer device, as shown in fig. 7, and the exception link handling apparatus includes: acquisition unit 301, content search unit 302, parsing unit 303, abnormality detection unit 304, and blocking unit 305:
an obtaining unit 301, configured to obtain a search link to be detected;
a content searching unit 302, configured to perform content search based on the search links to obtain a web page structure corresponding to each search link;
an analyzing unit 303, configured to analyze the web page structure to obtain description information of the search link in at least one content dimension;
an anomaly detection unit 304, configured to perform anomaly detection on the search link by using a corresponding anomaly link detection policy for the description information of each content dimension, to obtain an anomaly detection result of each content dimension;
a blocking unit 305, configured to block the search link based on the anomaly detection result.
In an embodiment, the anomaly detection unit 304 includes:
a filtering content detection subunit, configured to perform filtering content detection on the original web page content;
the first analysis subunit is used for analyzing the original webpage content to obtain main content in the original webpage content when the original webpage content does not include preset filtering content;
and the content detection subunit is used for carrying out content detection on the main content to obtain the abnormal detection result.
In one embodiment, the content detection subunit includes:
the text detection module is used for performing text detection on the main content;
and the abnormal keyword detection module is used for detecting abnormal keywords of the text content to obtain an abnormal detection result when the main content is detected to comprise the text content.
In one embodiment, the abnormal keyword detection module includes:
the word segmentation sub-module is used for carrying out word segmentation processing on the text content to obtain at least one text sub-word;
the keyword matching sub-module is used for matching the text sub-words with preset abnormal keywords to obtain keyword matching results;
and the result generation submodule is used for generating the abnormal detection result based on the keyword matching result.
In one embodiment, the result generation submodule is configured to:
when the text sub-words are not matched with the preset abnormal keywords, performing semantic extraction on the text content to obtain semantic features of the text content;
respectively carrying out forward coding and backward coding on the semantic features to obtain forward coding information corresponding to the forward coding and backward coding information corresponding to the backward coding;
fusing the forward coding information and the backward coding information to obtain fused coding information;
and calculating the abnormal probability of the search link based on the fused coding information to obtain the abnormal detection result.
In an embodiment, the abnormal keyword detection module further includes:
the keyword acquisition submodule is used for acquiring initial abnormal keywords;
the expansion submodule is used for performing expansion processing on the initial abnormal keywords to obtain expanded abnormal keywords;
the abnormal link searching submodule is used for performing abnormal link searching based on the expanded abnormal keywords to obtain an abnormal link searching result;
and the screening submodule is used for screening the preset abnormal key words from the expanded abnormal key words based on the abnormal link search result.
In an embodiment, the content detection subunit further includes:
the image detection module is used for carrying out image detection on the main content when the main content does not comprise text content;
the character recognition module is used for carrying out character recognition on the image content when the main body content is detected to comprise the image content;
and the abnormal character detection module is used for detecting abnormal characters of the character information to obtain an abnormal detection result when the character information of the image content is identified.
In an embodiment, the content detection subunit further includes:
the clustering module is used for clustering the original webpage content to obtain a target cluster corresponding to the original webpage content when the main content is detected not to include the image content;
the first calculation module is used for calculating the similarity between the objects in the target cluster;
and the judging module is used for judging whether the search link is an abnormal link or not based on the similarity.
In one embodiment, the abnormality detection unit includes:
the second analysis subunit is used for analyzing the link content to obtain domain names of the link content on different levels;
the clustering processing subunit is configured to perform clustering processing on the link content based on domain names of the link content at different levels to obtain a target domain name cluster corresponding to the link content;
and the similarity judging subunit is used for judging the similarity of the target domain name cluster and generating the abnormal detection result based on the judgment result.
In an embodiment, the cluster processing subunit includes:
the second calculation module is used for calculating the distance between the domain name and a preset domain name in a plurality of preset domain name clusters;
and the determining module is used for determining a target domain name cluster of the domain name cluster from the plurality of preset domain name clusters based on the distance.
In one embodiment, the abnormality detection unit includes:
the state code matching subunit is used for matching the link state code with a preset abnormal state code to obtain a state code matching result;
and the mapping subunit is used for mapping the matching result to the corresponding abnormal detection result.
In a specific implementation, the above units may be implemented as independent entities, or may be combined arbitrarily to be implemented as the same or several entities, and the specific implementation of the above units may refer to the foregoing method embodiments, which are not described herein again.
The abnormal link processing device can improve the reliability of detecting the abnormal link.
The embodiment of the present application further provides a computer device, where the computer device may include a terminal or a server, for example, the computer device may be used as an abnormal link processing terminal, and the terminal may be a mobile phone, a tablet computer, or the like; for another example, the computer device may be a server, such as an exception link handling server. As shown in fig. 8, it shows a schematic structural diagram of a terminal according to an embodiment of the present application, specifically:
the computer device may include components such as a processor 401 of one or more processing cores, memory 402 of one or more computer-readable storage media, a power supply 403, and an input unit 404. Those skilled in the art will appreciate that the computer device configuration illustrated in FIG. 8 does not constitute a limitation of computer devices, and may include more or fewer components than those illustrated, or some components may be combined, or a different arrangement of components. Wherein:
the processor 401 is a control center of the computer device, connects various parts of the entire computer device using various interfaces and lines, and performs various functions of the computer device and processes data by running or executing software programs and/or modules stored in the memory 402 and calling data stored in the memory 402, thereby monitoring the computer device as a whole. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user pages, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 401.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by operating the software programs and modules stored in the memory 402. The memory 402 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data created according to use of the computer device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 access to the memory 402.
The computer device further comprises a power supply 403 for supplying power to the various components, and preferably, the power supply 403 is logically connected to the processor 401 via a power management system, so that functions of managing charging, discharging, and power consumption are implemented via the power management system. The power supply 403 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.
The computer device may also include an input unit 404, the input unit 404 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.
Although not shown, the computer device may further include a display unit and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 401 in the computer device loads the executable file corresponding to the process of one or more application programs into the memory 402 according to the following instructions, and the processor 401 runs the application programs stored in the memory 402, thereby implementing various functions as follows:
acquiring a search link to be detected;
searching content based on the search links to obtain a webpage structure corresponding to each search link;
analyzing the webpage structure to obtain the description information of the search link on at least one content dimension;
aiming at the description information of each content dimension, carrying out anomaly detection on the search link by adopting a corresponding anomaly link detection strategy to obtain an anomaly detection result of each content dimension;
and carrying out blocking processing on the search link based on the abnormal detection result.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the method provided in the various alternative implementations of the above embodiments.
It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by a computer program, which may be stored in a computer-readable storage medium and loaded and executed by a processor, or by related hardware controlled by the computer program.
To this end, an embodiment of the present application further provides a storage medium, in which a computer program is stored, where the computer program can be loaded by a processor to execute the steps in any one of the exception link processing methods provided in the embodiment of the present application. For example, the computer program may perform the steps of:
acquiring a search link to be detected;
searching content based on the search links to obtain a webpage structure corresponding to each search link;
analyzing the webpage structure to obtain the description information of the search link on at least one content dimension;
aiming at the description information of each content dimension, carrying out anomaly detection on the search link by adopting a corresponding anomaly link detection strategy to obtain an anomaly detection result of each content dimension;
and carrying out blocking processing on the search link based on the abnormal detection result.
The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.
Since the computer program stored in the storage medium can execute the steps in any of the exception link processing methods provided in the embodiments of the present application, beneficial effects that can be achieved by any of the exception link processing methods provided in the embodiments of the present application can be achieved, which are detailed in the foregoing embodiments and will not be described herein again.
The exception link processing method, apparatus, computer device and storage medium provided in the embodiments of the present application are introduced in detail above, and a specific example is applied in the present application to explain the principle and implementation manner of the present application, and the description of the above embodiments is only used to help understanding the method and core ideas of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims (15)

1. An exception link handling method, comprising:
acquiring a search link to be detected;
searching content based on the search link to obtain a webpage structure corresponding to the search link;
analyzing the webpage structure to obtain the description information of the search link on at least one content dimension;
aiming at the description information of each content dimension, carrying out anomaly detection on the search link by adopting a corresponding anomaly link detection strategy to obtain an anomaly detection result of each content dimension;
and carrying out blocking processing on the search link based on the abnormal detection result.
2. The method of claim 1, wherein the description information in the at least one content dimension comprises original web content; the method for detecting the abnormality of the search link by adopting a corresponding abnormal link detection strategy aiming at the description information of each content dimension to obtain an abnormal detection result of each content dimension comprises the following steps:
performing filtering content detection on the original webpage content;
when the original webpage content does not comprise preset filtering content, analyzing the original webpage content to obtain main content in the original webpage content;
and performing content detection on the main content to obtain the abnormal detection result.
3. The method according to claim 2, wherein the performing content detection on the subject content to obtain the anomaly detection result comprises:
performing text detection on the main content;
and when the main content is detected to comprise text content, carrying out abnormal keyword detection on the text content to obtain the abnormal detection result.
4. The method according to claim 3, wherein the performing abnormal keyword detection on the text content to obtain the abnormal detection result comprises:
performing word segmentation processing on the text content to obtain at least one text sub-word;
matching the text sub-words with preset abnormal keywords to obtain keyword matching results;
and generating the abnormal detection result based on the keyword matching result.
5. The method of claim 4, wherein the generating the anomaly detection result based on the keyword matching result comprises:
when the text sub-words are not matched with the preset abnormal keywords, performing semantic extraction on the text content to obtain semantic features of the text content;
respectively carrying out forward coding and backward coding on the semantic features to obtain forward coding information corresponding to the forward coding and backward coding information corresponding to the backward coding;
fusing the forward coding information and the backward coding information to obtain fused coding information;
and calculating the abnormal probability of the search link based on the fused coding information to obtain the abnormal detection result.
6. The method according to claim 4, wherein before matching the text sub-words with the preset abnormal keywords, the method comprises:
acquiring an initial abnormal keyword;
performing expansion processing on the initial abnormal keywords to obtain expanded abnormal keywords;
performing abnormal link search based on the expanded abnormal keywords to obtain abnormal link search results;
and screening the preset abnormal keywords from the expanded abnormal keywords based on the abnormal link search result.
7. The method of claim 3, further comprising:
when detecting that the main content does not comprise text content, carrying out image detection on the main content;
when detecting that the main content comprises image content, performing character recognition on the image content;
and when the character information of the image content is identified, carrying out abnormal character detection on the character information to obtain the abnormal detection result.
8. The method of claim 7, further comprising:
when detecting that the main content does not comprise image content, clustering the original webpage content to obtain a target cluster corresponding to the original webpage content;
calculating the similarity between the objects in the target cluster;
and judging whether the search link is an abnormal link or not based on the similarity.
9. The method of claim 2, wherein the descriptive information in the at least one content dimension includes link content; the method for detecting the abnormality of the search link by adopting a corresponding abnormal link detection strategy aiming at the description information of each content dimension to obtain an abnormal detection result of each content dimension comprises the following steps:
analyzing the link content to obtain domain names of the link content on different levels;
clustering the link content based on the domain names of the link content on different levels to obtain a target domain name cluster corresponding to the link content;
and carrying out similarity discrimination on the target domain name cluster, and generating the abnormal detection result based on the discrimination result.
10. The method according to claim 9, wherein the clustering the link content based on the domain names of the link content at different levels to obtain a target domain name cluster corresponding to the link content comprises:
calculating the distance between the domain name and a preset domain name in a plurality of preset domain name clusters;
and determining a target domain name cluster of the domain name cluster from the plurality of preset domain name clusters based on the distance.
11. The method of claim 2, wherein the descriptive information in the at least one content dimension includes a link status code; the method for detecting the abnormality of the search link by adopting a corresponding abnormal link detection strategy aiming at the description information of each content dimension to obtain an abnormal detection result of each content dimension comprises the following steps:
matching the link state code with a preset abnormal state code to obtain a state code matching result;
and mapping the matching result to a corresponding abnormal detection result.
12. An exception link handling apparatus, comprising:
the acquisition unit is used for acquiring a search link to be detected;
the content searching unit is used for searching contents based on the searching link to obtain a webpage structure corresponding to the searching link;
the analysis unit is used for analyzing the webpage structure to obtain the description information of the search link on at least one content dimension;
the anomaly detection unit is used for carrying out anomaly detection on the search link by adopting a corresponding anomaly link detection strategy aiming at the description information of each content dimension to obtain an anomaly detection result of each content dimension;
and the forbidding unit is used for carrying out forbidding processing on the search link based on the abnormal detection result.
13. A computer device comprising a memory and a processor; the memory stores an application program, and the processor is configured to execute the application program in the memory to perform the operations of the exception link processing method according to any one of claims 1 to 11.
14. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the method of exception link handling according to any one of claims 1 to 11.
15. A computer program product comprising a computer program or instructions, characterized in that the computer program or instructions, when executed by a processor, implement the steps in the exception link handling method of any one of claims 1 to 11.
CN202111242832.0A 2021-10-25 2021-10-25 Abnormal link processing method and device, computer equipment and storage medium Pending CN114329287A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111242832.0A CN114329287A (en) 2021-10-25 2021-10-25 Abnormal link processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111242832.0A CN114329287A (en) 2021-10-25 2021-10-25 Abnormal link processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114329287A true CN114329287A (en) 2022-04-12

Family

ID=81045476

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111242832.0A Pending CN114329287A (en) 2021-10-25 2021-10-25 Abnormal link processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114329287A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115150354A (en) * 2022-06-29 2022-10-04 北京天融信网络安全技术有限公司 Method and device for generating domain name, storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115150354A (en) * 2022-06-29 2022-10-04 北京天融信网络安全技术有限公司 Method and device for generating domain name, storage medium and electronic equipment
CN115150354B (en) * 2022-06-29 2023-11-10 北京天融信网络安全技术有限公司 Method and device for generating domain name, storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
US10764353B2 (en) Automatic genre classification determination of web content to which the web content belongs together with a corresponding genre probability
US9489401B1 (en) Methods and systems for object recognition
US7917514B2 (en) Visual and multi-dimensional search
US8051080B2 (en) Contextual ranking of keywords using click data
CN109726274B (en) Question generation method, device and storage medium
JP2018518788A (en) Web page training method and apparatus, search intention identification method and apparatus
JP2020027649A (en) Method, apparatus, device and storage medium for generating entity relationship data
US10031924B2 (en) System and method for feature recognition and document searching based on feature recognition
JP2010501096A (en) Cooperative optimization of wrapper generation and template detection
EP4258610A1 (en) Malicious traffic identification method and related apparatus
CN111079043A (en) Key content positioning method
CN111832290A (en) Model training method and device for determining text relevancy, electronic equipment and readable storage medium
CN113660541A (en) News video abstract generation method and device
CN109165373B (en) Data processing method and device
CN112579781B (en) Text classification method, device, electronic equipment and medium
CN114329287A (en) Abnormal link processing method and device, computer equipment and storage medium
CN112948573B (en) Text label extraction method, device, equipment and computer storage medium
CN115801455A (en) Website fingerprint-based counterfeit website detection method and device
KR20240013640A (en) Method for detecting harmful url
CN115774797A (en) Video content retrieval method, device, equipment and computer readable storage medium
KR102127635B1 (en) Big data based web-accessibility improvement apparatus and method
CN108009233B (en) Image restoration method and device, computer equipment and storage medium
CN111581950A (en) Method for determining synonym and method for establishing synonym knowledge base
EP2821934A1 (en) System and method for optical character recognition and document searching based on optical character recognition
CN114548083B (en) Title generation method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40070383

Country of ref document: HK