CN117176483A - Abnormal URL identification method and device and related products - Google Patents

Abnormal URL identification method and device and related products Download PDF

Info

Publication number
CN117176483A
CN117176483A CN202311455059.5A CN202311455059A CN117176483A CN 117176483 A CN117176483 A CN 117176483A CN 202311455059 A CN202311455059 A CN 202311455059A CN 117176483 A CN117176483 A CN 117176483A
Authority
CN
China
Prior art keywords
url data
abnormal
url
data set
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311455059.5A
Other languages
Chinese (zh)
Inventor
龙磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Iresearch Technology Co ltd
Original Assignee
Beijing Iresearch Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Iresearch Technology Co ltd filed Critical Beijing Iresearch Technology Co ltd
Priority to CN202311455059.5A priority Critical patent/CN117176483A/en
Publication of CN117176483A publication Critical patent/CN117176483A/en
Pending legal-status Critical Current

Links

Abstract

The application provides a method and a device for identifying abnormal URLs and related products, wherein the method comprises the following steps: acquiring a first URL data set in the DPI data stream; matching the first URL dataset with a whitelist; if the URL data in the first URL data set is successfully matched with the white list, removing the URL data successfully matched with the first URL data set to obtain a second URL data set, wherein the white list is a set of normal URL data; if the URL data in the first URL data set is failed to be matched with the white list, the first URL data set is used as a second URL data set; and identifying abnormal URL data in the second URL data set by using the trained keyword identification model. The method provided by the application can determine the abnormal URL data in real time in the real-time DPI data stream, and has stronger timeliness compared with the prior art.

Description

Abnormal URL identification method and device and related products
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a method and apparatus for identifying an abnormal URL, and a related product.
Background
With the continuous development of the age, networks exist on all sides of people's life, people know about reality through the networks, acquire knowledge through the networks, and consume through the networks. At the same time, there is also much spurious information on the network that can cause some users to be deceived, resulting in loss of property.
The prior art provides a method of identifying spurious information on a network. The prior art may collect uniform resource locators (Uniform Resource Locator, URL) where users get false information on the network. And generating a blacklist by utilizing the collected uniform resource locators, when a user accesses a webpage corresponding to the uniform resource locators in the blacklist, sending an early warning to the user, prompting the user through the early warning, prompting false information in the website which the user is accessing, paying attention to screening, and avoiding loss of property. The abnormal uniform resource locators obtained by using the black list in the prior art method tend to be less time-efficient.
Therefore, how to improve the timeliness of the identified abnormal uniform resource locators is a technical problem that needs to be solved by those skilled in the art.
Disclosure of Invention
Based on the problems, the application provides a method and a device for identifying abnormal URLs and related products to solve the problem that abnormal uniform resource locators obtained in the prior art are poor in timeliness.
The application provides an abnormal URL identification method, which comprises the following steps:
acquiring a first URL data set in the DPI data stream;
matching the first URL dataset with a whitelist;
if the URL data in the first URL data set is successfully matched with the white list, removing the URL data successfully matched with the first URL data set to obtain a second URL data set, wherein the white list is a set of normal URL data;
if the URL data in the first URL data set is failed to be matched with the white list, the first URL data set is used as a second URL data set;
and identifying abnormal URL data in the second URL data set by using the trained keyword identification model.
In one possible implementation, the method further includes:
identifying text information in page information corresponding to the abnormal URL data by using a trained text identification model;
and if abnormal text information exists in the page information, sending early warning information to a user accessing the abnormal URL data.
In one possible implementation, the method further includes:
identifying image information in page information corresponding to the abnormal URL data by using a trained image identification model;
and if abnormal image information exists in the page information, sending early warning information to a user accessing the abnormal URL data.
In one possible implementation, the method further includes:
extracting features of the abnormal URL data to obtain a plurality of URL features;
and if any abnormal URL feature exists in the plurality of URL features, sending early warning information to a user accessing the abnormal URL data.
In one possible implementation, the text recognition model is trained by:
processing text information in a webpage corresponding to the historical abnormal URL data to obtain a plurality of abnormal text information;
vectorizing the abnormal text information to obtain an abnormal text information vector;
constructing a training sample set, wherein sample data in the training sample set is an abnormal text information vector with labels;
and training the basic model by using the training sample set to obtain a text recognition model.
The application also provides a device for identifying the abnormal URL, which comprises the following modules:
the acquisition module is used for acquiring a first URL data set in the DPI data stream;
the matching module is used for matching the first URL data set with a white list;
the removing module is used for removing the URL data successfully matched with the white list from the first URL data set to obtain a second URL data set if the URL data in the first URL data set is successfully matched with the white list, wherein the white list is a set of normal URL data;
the determining module is used for taking the first URL data set as a second URL data set if the URL data in the first URL data set fails to match the white list;
and the keyword recognition module is used for recognizing abnormal URL data in the second URL data set by using the trained keyword recognition model.
In one possible implementation, the apparatus further includes:
the text recognition module is used for recognizing text information in the page information corresponding to the abnormal URL data by using the trained text recognition model;
and the early warning module is used for sending early warning information to a user accessing the abnormal URL data if abnormal text information exists in the page information.
In one possible implementation, the apparatus further includes:
the image recognition module is used for recognizing the image information in the page information corresponding to the abnormal URL data by using the trained image recognition model;
and the early warning module is used for sending early warning information to a user accessing the abnormal URL data if abnormal image information exists in the page information.
The application also provides an electronic device comprising a processor and a memory:
the memory is used for storing a computer program and transmitting the computer program to the processor;
the processor is used for executing the steps of the identification method of the abnormal URL according to the instructions in the computer program.
The present application also provides a computer-readable storage medium storing a computer program which, when executed by an electronic device, implements the steps of the above-described abnormal URL identification method.
Compared with the prior art, the application has the following beneficial effects:
after the first URL data set is acquired, the method provided by the application determines the normal URL data in the first URL data set through matching with the white list, and the second URL data set is obtained after the normal URL data is ensured to be removed from the first URL data set. And identifying abnormal URL data in the second URL data set by using the trained keyword identification model. The method provided by the application acquires the first URL data set from the DPI data stream, wherein the URL data in the first URL data set can comprise URL data corresponding to the webpage which the user has visited, and can also comprise URL data received by the user. The method provided by the application analyzes different kinds of URL data, and obtains abnormal URL data by analyzing the URL data obtained from DPI data flow generated in real time.
Drawings
In order to more clearly illustrate the embodiments of the application or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the application, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
FIG. 1 is a flowchart of a method for identifying abnormal URLs according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an abnormal URL identification process according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a training process of a text recognition model according to an embodiment of the present application;
fig. 4 is a schematic diagram of a preprocessing flow of text information in a web page corresponding to abnormal URL data according to an embodiment of the present application;
fig. 5 is a schematic diagram of a preprocessing flow of image information in a web page corresponding to abnormal URL data according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an apparatus for identifying abnormal URLs according to an embodiment of the present application.
Detailed Description
As described above, there is a large amount of false information on today's networks that can pose a threat to the user's property. In the related art, when a user accesses a webpage corresponding to false information, a certain reminder can be sent to the user to remind the user that the false information exists in the webpage being accessed.
It has been found that when a user accesses a web page corresponding to false information, the reminding of the user in the prior art depends on a blacklist, and the blacklist is a set of abnormal URL data. In one possible implementation, user a accesses web page a, where false information exists, and believes that web page a has lost user a's property, and may obtain URL data corresponding to web page a, and add the URL data to the blacklist. On the basis that the blacklist is established, the user B accesses the webpage A, false information exists in the webpage A, and because URL data corresponding to the webpage A exists in the blacklist, a prompt can be sent to the user B accessing the webpage A to remind the user B that the false information exists in the webpage A. The user B accesses the webpage B, false information exists in the webpage B, and the property of the user B is possibly infringed due to the fact that URL data corresponding to the webpage B does not exist in the blacklist. The web page corresponding to the URL data in the blacklist has false information, but the blacklist is established under the condition that the user is deceptively, and when a user accesses a brand new web page which is not recorded in the blacklist, whether the user has false information or not is unknown in the web page. In the method provided by the related art, the timeliness of the abnormal URL data existing in the blacklist is poor. In order to improve timeliness of the identified abnormal URL data, the application provides an abnormal URL identification method, an abnormal URL identification device and related products.
In order to make the present application better understood by those skilled in the art, the following description will clearly and completely describe the technical solutions in the embodiments of the present application with reference to the accompanying drawings, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments. Based on the embodiments of the present application, all other embodiments that a person of ordinary skill in the art could obtain without making any inventive effort are within the scope of the present application.
It can be understood that the method provided by the application can be applied to a processing device, where the processing device can obtain the first URL data set in the DPI data stream, for example, a terminal device or a server that can obtain the first URL data set in the DPI data stream. The method provided by the application can be independently executed through the terminal equipment or the server, can also be applied to a network scene of communication between the terminal equipment and the server, and is executed through the cooperation of the terminal equipment and the server. The terminal equipment can be a computer, a mobile phone and other equipment. The server can be understood as an application server or a Web server, and can be an independent server or a cluster server in actual deployment.
Fig. 1 is a flowchart of a method for identifying abnormal URLs, which is provided by the application, and includes the following steps:
s101: a first URL data set is acquired in a DPI data stream.
Deep packet inspection (Deep Packet Inspection, DPI) data flows are derived by users during use of the network based on deep packet inspection techniques. Each packet in the network has a header that contains basic information about the sender, receiver, and transmission time of the packet. The deep packet inspection technique may not only obtain information in the packet header, but also obtain other information in the packet except the packet header, such as URL data existing in the packet. In addition to URL data, information such as user identification, domain name, access time, terminal information, network protocol, IP address information, upstream and downstream traffic, etc. may also exist in the data packet.
The processing device obtains a plurality of URL data in the DPI data stream, and constructs a first URL data set using the plurality of URL data. In one possible implementation, the processing device may obtain the DPI data flow from parallel data flows in a distributed environment.
S102: the first URL dataset is matched to the whitelist.
The white list is a collection of normal URL data. The normal operation of the web page needs to have corresponding record information, and generally, URL data with record information is normal URL data. In one possible implementation, the whitelist may be constructed using URL data with documented information.
The processing device matches URL data in the first URL data set with the whitelist.
S103: and if the URL data in the first URL data set is successfully matched with the white list, removing the URL data successfully matched with the white list from the first URL data set to obtain a second URL data set.
If the processing device determines that the first URL data set and the white list have the same URL data in the process of matching the URL data in the first URL data set and the white list, it may be determined that the matching of the URL data in the first URL data set and the white list is successful. The URL data in the first URL data set that is successfully matched with the white list is normal URL data, and at this time, the processing device may remove the normal URL data from the first URL data set, and use the data set obtained by removing the normal URL data as the second URL data set.
In one possible implementation, ten pieces of URL data exist in the first URL data set, and the processing device finds that three pieces of URL data exist in the white list in the ten pieces of URL data, at this time, the processing device may remove the three pieces of URL data from the first URL data set, and seven pieces of URL data after removing the three pieces of URL data from the first URL data set may be used as the second URL data set.
S104: and if the URL data in the first URL data set is failed to match the white list, taking the first URL data set as a second URL data set.
If the processing device determines that any piece of URL data in the first URL data set does not exist in the white list in the process of matching the URL data in the first URL data set with the white list, the matching can be determined to be failed. At this time, the processing device may take the first URL data set as the second URL data set.
S105: and identifying abnormal URL data in the second URL data set by using the trained keyword identification model.
The abnormal URL data has certain characteristics, and the keyword recognition model is obtained by training aiming at the characteristics of the abnormal URL data.
In one possible implementation, the keywords may be similar words and the keyword recognition model may recognize similar words in the URL data of the second URL data set. For example, one piece of normal URL data is http:///lll. The normal URL data includes five keywords formed by l, and the abnormal URL data includes keywords composed of four english letters l and one number 1. "llll" and "ll1ll" are not readily noticeable. For example, the user a wants to purchase through the web page whose URL is http:///lll/a/B, but the web page corresponding to the URL is not http:///a/B, but is http:///ll 1 ll/a/B, if the web page corresponding to the abnormal URL data is edited by the web page corresponding to the normal URL data, the user a is likely to be spoofed by the information in the web page corresponding to the http:///ll 1 ll/a/B, resulting in property loss. The keyword recognition model can be obtained by training by taking an XgBoost series model as a basic model.
The processing device can recognize whether the second URL data set has the similar word through the keyword recognition model, the similar word is similar word relative to the normal URL data, and when the processing device recognizes that the URL data in the second URL data set has the similar word through the keyword recognition model, the URL data with the similar word is determined to be abnormal URL data.
In another possible implementation, the keyword may be a predetermined word vector, and the keyword recognition model may recognize whether the predetermined word vector exists in URL data of the second URL data set. In one implementation, the pre-set word vector is obtained by analyzing historical abnormal URL data. For example, it was found from analysis of the history abnormal URL data that most of the history abnormal URL data has keywords formed by unordered alternation of numerals and letters, such as keywords of "uv5sw1 e". In this case, the preset word vectors may include word vectors formed by alternately forming such digits and letters in disorder. The keyword recognition model can be obtained by training by selecting a Fast Text series model as a basic model.
The processing device can recognize whether the preset word vector exists in the second URL data set through the keyword recognition model, and when the processing device recognizes that the preset word vector exists in the URL data in the second URL data set through the keyword recognition model, the URL data with the preset word vector is determined to be abnormal URL data.
In one possible implementation, there may be multiple keyword recognition models, such as a near word recognition model and a word vector recognition model, by which abnormal URL data in the second URL data set may be recognized.
In the method provided by the application, the URL data in the first URL data set can comprise the URL data corresponding to the webpage accessed by the user, and can also comprise the URL data received by the user. The method provided by the application analyzes different kinds of URL data, and obtains abnormal URL data by analyzing the URL data obtained from DPI data flow generated in real time. The method provided by the application can identify the abnormal URL data in the second URL data set through the keyword identification model, the keyword identification model can identify similar words, and can also identify preset word vectors, and the accuracy rate of identifying the abnormal URL data is increased by identifying different types of keywords.
When the processing device identifies the abnormal URL data, the identified abnormal URL data may be further processed. Some text information and non-text information are typically contained in the web page, the non-text information typically being image information. Whether the abnormal URL data causes property loss to the user is judged mainly according to text information and non-text information in the page corresponding to the abnormal URL data. In one possible implementation, the processing device may identify text information in the page information corresponding to the abnormal URL data using a pre-trained text recognition model. The processing equipment carries out preprocessing such as text cleaning, text word segmentation, text vectorization and the like on text information in pages corresponding to the abnormal URL data to obtain text vectors in the pages corresponding to the abnormal URL data. The vectorization process can use TF-IDF algorithm, or WordEmbeddding algorithm, or can use TF-IDF and WordEmbedding algorithms. There may be a difference in dimension between different text vectors, and normalization processing may be performed on the text vectors having the difference in dimension, so that the dimensions between the text vectors are the same. Inputting text vectors in pages corresponding to the abnormal URL data into a pre-trained text recognition model, recognizing whether abnormal text information exists in the text vectors through the pre-trained text recognition model, and if the abnormal text information exists, sending early warning information to a user accessing the abnormal URL data by the processing equipment, and reminding the user of false information in the pages being accessed by the early warning information.
In one possible implementation, the text recognition model is trained by: the processing equipment processes text information in the webpage corresponding to the historical abnormal URL data to obtain a plurality of abnormal text information, and vectorizes the abnormal text information to obtain an abnormal text information vector, wherein the abnormal text information vector corresponds to the first type label. The processing equipment processes the text information in the webpage corresponding to the normal URL data to obtain a plurality of normal text information, and vectorizes the normal text information to obtain a normal text information vector, wherein the normal text information vector corresponds to the second type label. And constructing a training set by using the abnormal text information vector and the normal text information vector, and continuously training a basic model by using the text information vector in the training set until a training cut-off condition is met, so as to finally obtain a text recognition model. The training cutoff condition may be training a preset round or model to a preset accuracy.
According to the method provided by the application, the keyword recognition model is directly utilized to recognize the abnormal URL data, the text recognition model is utilized to recognize the text information in the webpage corresponding to the abnormal URL data, and the text recognition model is utilized to reconfirm the abnormal URL data, so that false information is ensured to exist in the webpage corresponding to the abnormal URL data, and the accuracy of the determined abnormal URL data is ensured.
In one possible implementation, the processing device may identify the image information in the page information corresponding to the abnormal URL data using a pre-trained image identification model. The processing equipment firstly carries out one or more of preprocessing such as denoising, image enhancement, gray level diagram conversion, binary diagram conversion and the like on the image in the page corresponding to the abnormal URL data. And positioning and processing by an optical character recognition (Optical Character Recognition, ORC) module of the image recognition model to obtain a text region in the image, obtaining text information by a CNN convolutional neural network or a CRNN convolutional neural network in the image recognition model, and determining whether the image has abnormal image information by an analysis module of the image recognition model. If the processing equipment determines that abnormal image information exists in the page information corresponding to the abnormal URL data through the image recognition model, early warning information is sent to a user accessing the abnormal URL data, and false information exists in the page which the user is accessing is reminded through the early warning information.
According to the method provided by the application, not only can the text information in the webpage corresponding to the abnormal URL data be identified through the text identification model, but also the image information in the webpage corresponding to the abnormal URL data can be identified through the image identification model, and the abnormal URL data can be confirmed again through the image identification model, so that false information is ensured to exist in the webpage corresponding to the abnormal URL data, and the accuracy of the determined abnormal URL data is ensured.
In one possible implementation manner, after the processing device identifies the abnormal URL data in the second URL data set, the abnormal URL data may be subjected to feature extraction to obtain a plurality of URL features, the plurality of URL features are analyzed, and if any one abnormal URL feature exists in the plurality of URL features, early warning information is sent to the user accessing the abnormal URL data. The URL characteristics may include the length of the URL, the length of the domain name portion in the URL, the number of domain names in the URL, the length of a specific string in the URL, the number of times of occurrence of "@" in the URL, the number of times of occurrence of "-" in the URL, the number of times of occurrence of "//" in the URL, the number of times of occurrence of preset symbols in the URL, whether the URL contains an IP address, the number of times of access of the URL, the registration time of the domain name in the URL, the number of times of redirection in the URL, the location of occurrence of the top-level domain name in the URL, and the like. The abnormal URL feature may be analyzed from historical abnormal URL data. In one possible implementation, the length of the URL corresponding to the historical abnormal URL data is mostly more than 200 characters, and then the URL feature more than 200 characters may be determined as the abnormal URL feature. In another possible implementation, if the number of redirections in the partial historical abnormal URL data is greater, then a URL feature with a number of redirections exceeding the preset number may be determined as an abnormal URL feature. In yet another possible implementation, the registration time of the domain name corresponding to the part of the historical abnormal URL data is often shorter, and then a URL feature with the registration time of the domain name being less than one year in the URL may be determined as the abnormal URL feature.
According to the method provided by the application, the historical abnormal URL data is analyzed to obtain the abnormal URL characteristics possibly existing in the abnormal URL data, and the abnormal URL data is reconfirmed through the abnormal URL characteristics so as to ensure the accuracy of the determined abnormal URL data.
Fig. 2 is a schematic diagram of an abnormal URL identification flow provided in the present application, where after a processing device obtains a DPI data stream, URL data in the DPI data stream is subjected to whitelist matching, near word recognition model recognition and word vector recognition model recognition in sequence. The white list is a set of normal URL data, the similar word stock can be obtained by analyzing historical abnormal URL data, and the word vector dictionary stock comprises a plurality of preset word vectors. And respectively carrying out white list matching according to the sequence, obtaining abnormal URL data after the recognition of the similar word recognition model and the recognition of the word vector recognition model, simultaneously recognizing the abnormal URL data by utilizing the text recognition model and the image recognition model, and carrying out URL characteristic analysis on the abnormal URL data. Any one of the text recognition model, the image recognition model, and the URL feature analysis may determine the abnormal URL data. After the abnormal URL data is determined, early warning can be sent to a user accessing the webpage corresponding to the abnormal URL data.
Fig. 3 is a schematic diagram of a training process of a text recognition model according to the present application, where an appropriate base model is selected first, and in a possible implementation, the base model may select a LightGBM series model. And integrating the historical abnormal URL data, the normal URL data and the URL data existing in the database to form a training set, wherein the URL data in the training set are marked. The data in the training set is preprocessed, which may include one or more of data cleansing, data integration, and data conversion. Outliers refer to values in the data where one or more values differ significantly from other values, and the processing device may remove data with outliers greater than a preset value from the training set. Index system construction refers to determining positive and negative samples in the training process. And finally, extracting the characteristics of the data in the training set, and carrying out model training and verification by using the extracted characteristics.
Fig. 4 is a schematic diagram of a preprocessing flow of text information in a web page corresponding to abnormal URL data, where a preprocessed text is a preprocessed text in a web page corresponding to abnormal URL data, and the preprocessing text is cleaned first, then the cleaned preprocessed text is subjected to word segmentation, and a plurality of word segmentation results are obtained after the preprocessed text is segmented. Firstly, matching a word segmentation result with a stop word in a stop word list, deleting the word segmentation result if the word segmentation result exists in the stop word list, judging whether an unprocessed word segmentation result exists or not if the word segmentation result does not exist in the stop word list, if so, continuing to process the next unprocessed word segmentation result until the word segmentation result is completely processed, and finally ending the flow.
Fig. 5 is a schematic diagram of a preprocessing flow of image information in a web page corresponding to abnormal URL data, wherein the image in the web page corresponding to abnormal URL data is input first, one or more of denoising, image enhancement, gray-scale image conversion and binary image conversion are performed on the image, then text region detection is performed on the preprocessed image, text recognition is performed after the detection, finally text information features are extracted, and finally text information of the image is output.
The application also provides a schematic structural diagram of an abnormal URL identifying device as shown in fig. 6, where the abnormal URL identifying device 600 includes the following modules:
an obtaining module 601, configured to obtain a first URL data set in a DPI data stream;
a matching module 602, configured to match the first URL data set with a whitelist;
a removing module 603, configured to remove the URL data successfully matched with the first URL data set to obtain a second URL data set if the URL data in the first URL data set is successfully matched with the whitelist, where the whitelist is a set of normal URL data;
a determining module 604, configured to take the first URL data set as a second URL data set if the URL data in the first URL data set fails to match the whitelist;
the keyword recognition module 605 is configured to recognize abnormal URL data in the second URL data set using a trained keyword recognition model.
In one possible implementation, the apparatus further includes a text recognition module:
the text recognition module is used for recognizing text information in page information corresponding to the abnormal URL data by using a trained text recognition model;
and if abnormal text information exists in the page information, sending early warning information to a user accessing the abnormal URL data.
In one possible implementation, the apparatus further includes an image recognition module:
the image recognition module is used for recognizing image information in the page information corresponding to the abnormal URL data by using a trained image recognition model;
and if abnormal image information exists in the page information, sending early warning information to a user accessing the abnormal URL data.
In one possible implementation, the apparatus further includes a URL feature identification module:
the URL feature recognition module is used for carrying out feature extraction on the abnormal URL data to obtain a plurality of URL features;
and if any abnormal URL feature exists in the plurality of URL features, sending early warning information to a user accessing the abnormal URL data.
In a possible implementation manner, the device further comprises a model training module, wherein the model training module is used for processing text information in a webpage corresponding to the historical abnormal URL data to obtain a plurality of abnormal text information;
vectorizing the abnormal text information to obtain an abnormal text information vector;
constructing a training sample set, wherein sample data in the training sample set is an abnormal text information vector with labels;
and training the basic model by using the training sample set to obtain a text recognition model.
The embodiment of the application also provides an abnormal URL identification device, wherein the device comprises a memory and a processor, the memory is used for storing instructions or codes, and the processor is used for executing the instructions or codes so that the device executes the steps of the abnormal URL identification method according to any embodiment of the application.
In practical applications, the computer-readable storage medium may take the form of any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium.
The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).
It should be noted that, in the present specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment is mainly described in a different point from other embodiments. In particular, for the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points. The apparatus embodiments described above are merely illustrative, wherein elements illustrated as separate elements may or may not be physically separate, and elements illustrated as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present application without undue burden.
The foregoing is only one specific embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the technical scope of the present application should be included in the scope of the present application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.

Claims (10)

1. An abnormal URL identification method, comprising:
acquiring a first URL data set in the DPI data stream;
matching the first URL dataset with a whitelist;
if the URL data in the first URL data set is successfully matched with the white list, removing the URL data successfully matched with the first URL data set to obtain a second URL data set, wherein the white list is a set of normal URL data;
if the URL data in the first URL data set is failed to be matched with the white list, the first URL data set is used as a second URL data set;
and identifying abnormal URL data in the second URL data set by using the trained keyword identification model.
2. The method according to claim 1, wherein the method further comprises:
identifying text information in page information corresponding to the abnormal URL data by using a trained text identification model;
and if abnormal text information exists in the page information, sending early warning information to a user accessing the abnormal URL data.
3. The method according to claim 1, wherein the method further comprises:
identifying image information in page information corresponding to the abnormal URL data by using a trained image identification model;
and if abnormal image information exists in the page information, sending early warning information to a user accessing the abnormal URL data.
4. The method according to claim 1, wherein the method further comprises:
extracting features of the abnormal URL data to obtain a plurality of URL features;
and if any abnormal URL feature exists in the plurality of URL features, sending early warning information to a user accessing the abnormal URL data.
5. The method of claim 2, wherein the text recognition model is trained by:
processing text information in a webpage corresponding to the historical abnormal URL data to obtain a plurality of abnormal text information;
vectorizing the abnormal text information to obtain an abnormal text information vector;
constructing a training sample set, wherein sample data in the training sample set is an abnormal text information vector with labels;
and training the basic model by using the training sample set to obtain a text recognition model.
6. An apparatus for identifying an abnormal URL, comprising:
the acquisition module is used for acquiring a first URL data set in the DPI data stream;
the matching module is used for matching the first URL data set with a white list;
the removing module is used for removing the URL data successfully matched with the white list from the first URL data set to obtain a second URL data set if the URL data in the first URL data set is successfully matched with the white list, wherein the white list is a set of normal URL data;
the determining module is used for taking the first URL data set as a second URL data set if the URL data in the first URL data set fails to match the white list;
and the keyword recognition module is used for recognizing abnormal URL data in the second URL data set by using the trained keyword recognition model.
7. The apparatus of claim 6, wherein the apparatus further comprises:
the text recognition module is used for recognizing text information in the page information corresponding to the abnormal URL data by using the trained text recognition model;
and the early warning module is used for sending early warning information to a user accessing the abnormal URL data if abnormal text information exists in the page information.
8. The apparatus of claim 6, wherein the apparatus further comprises:
the image recognition module is used for recognizing the image information in the page information corresponding to the abnormal URL data by using the trained image recognition model;
and the early warning module is used for sending early warning information to a user accessing the abnormal URL data if abnormal image information exists in the page information.
9. An electronic device comprising a memory and a processor, wherein:
the memory is used for storing a computer program;
the processor is configured to execute the computer program to implement the method for identifying abnormal URLs according to any one of claims 1-5.
10. A computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the method of identifying an abnormal URL according to any one of claims 1-5.
CN202311455059.5A 2023-11-03 2023-11-03 Abnormal URL identification method and device and related products Pending CN117176483A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311455059.5A CN117176483A (en) 2023-11-03 2023-11-03 Abnormal URL identification method and device and related products

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311455059.5A CN117176483A (en) 2023-11-03 2023-11-03 Abnormal URL identification method and device and related products

Publications (1)

Publication Number Publication Date
CN117176483A true CN117176483A (en) 2023-12-05

Family

ID=88945450

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311455059.5A Pending CN117176483A (en) 2023-11-03 2023-11-03 Abnormal URL identification method and device and related products

Country Status (1)

Country Link
CN (1) CN117176483A (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101035128A (en) * 2007-04-18 2007-09-12 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
US20120159620A1 (en) * 2010-12-21 2012-06-21 Microsoft Corporation Scareware Detection
US8850567B1 (en) * 2008-02-04 2014-09-30 Trend Micro, Inc. Unauthorized URL requests detection
US20150156210A1 (en) * 2013-12-04 2015-06-04 Apple Inc. Preventing url confusion attacks
CN106055574A (en) * 2016-05-19 2016-10-26 微梦创科网络科技(中国)有限公司 Method and device for recognizing illegal URL
CN106131071A (en) * 2016-08-26 2016-11-16 北京奇虎科技有限公司 A kind of Web method for detecting abnormality and device
CN106357618A (en) * 2016-08-26 2017-01-25 北京奇虎科技有限公司 Web abnormality detection method and device
CN107332848A (en) * 2017-07-05 2017-11-07 重庆邮电大学 A kind of exception of network traffic real-time monitoring system based on big data
CN111753171A (en) * 2020-06-09 2020-10-09 北京天空卫士网络安全技术有限公司 Malicious website identification method and device
CN112532624A (en) * 2020-11-27 2021-03-19 深信服科技股份有限公司 Black chain detection method and device, electronic equipment and readable storage medium
CN112948725A (en) * 2021-03-02 2021-06-11 北京六方云信息技术有限公司 Phishing website URL detection method and system based on machine learning
CN113596007A (en) * 2021-07-22 2021-11-02 广东电网有限责任公司 Vulnerability attack detection method and device based on deep learning
CN113836459A (en) * 2021-08-10 2021-12-24 深圳市高腾科技服务有限公司 Web site page monitoring method, device, equipment and storage medium
CN115168755A (en) * 2022-07-26 2022-10-11 北京永信至诚科技股份有限公司 Abnormal data processing method and system based on URL (Uniform resource locator) characteristics
CN115878932A (en) * 2022-12-09 2023-03-31 杭州安恒信息技术股份有限公司 Website security event processing method, device, equipment and medium
CN116647377A (en) * 2023-05-26 2023-08-25 太平金融科技服务(上海)有限公司深圳分公司 Website inspection method and device, electronic equipment and storage medium

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101035128A (en) * 2007-04-18 2007-09-12 大连理工大学 Three-folded webpage text content recognition and filtering method based on the Chinese punctuation
US8850567B1 (en) * 2008-02-04 2014-09-30 Trend Micro, Inc. Unauthorized URL requests detection
US20120159620A1 (en) * 2010-12-21 2012-06-21 Microsoft Corporation Scareware Detection
US20150156210A1 (en) * 2013-12-04 2015-06-04 Apple Inc. Preventing url confusion attacks
CN106055574A (en) * 2016-05-19 2016-10-26 微梦创科网络科技(中国)有限公司 Method and device for recognizing illegal URL
CN106131071A (en) * 2016-08-26 2016-11-16 北京奇虎科技有限公司 A kind of Web method for detecting abnormality and device
CN106357618A (en) * 2016-08-26 2017-01-25 北京奇虎科技有限公司 Web abnormality detection method and device
CN107332848A (en) * 2017-07-05 2017-11-07 重庆邮电大学 A kind of exception of network traffic real-time monitoring system based on big data
CN111753171A (en) * 2020-06-09 2020-10-09 北京天空卫士网络安全技术有限公司 Malicious website identification method and device
CN112532624A (en) * 2020-11-27 2021-03-19 深信服科技股份有限公司 Black chain detection method and device, electronic equipment and readable storage medium
CN112948725A (en) * 2021-03-02 2021-06-11 北京六方云信息技术有限公司 Phishing website URL detection method and system based on machine learning
CN113596007A (en) * 2021-07-22 2021-11-02 广东电网有限责任公司 Vulnerability attack detection method and device based on deep learning
CN113836459A (en) * 2021-08-10 2021-12-24 深圳市高腾科技服务有限公司 Web site page monitoring method, device, equipment and storage medium
CN115168755A (en) * 2022-07-26 2022-10-11 北京永信至诚科技股份有限公司 Abnormal data processing method and system based on URL (Uniform resource locator) characteristics
CN115878932A (en) * 2022-12-09 2023-03-31 杭州安恒信息技术股份有限公司 Website security event processing method, device, equipment and medium
CN116647377A (en) * 2023-05-26 2023-08-25 太平金融科技服务(上海)有限公司深圳分公司 Website inspection method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN108874777B (en) Text anti-spam method and device
CN109194677A (en) A kind of SQL injection attack detection, device and equipment
CN112311803B (en) Rule base updating method and device, electronic equipment and readable storage medium
CN112989348B (en) Attack detection method, model training method, device, server and storage medium
CN107944032B (en) Method and apparatus for generating information
CN110020161B (en) Data processing method, log processing method and terminal
CN112333706A (en) Internet of things equipment anomaly detection method and device, computing equipment and storage medium
CN110427628A (en) Web assets classes detection method and device based on neural network algorithm
CN113051500A (en) Phishing website identification method and system fusing multi-source data
CN111460803B (en) Equipment identification method based on Web management page of industrial Internet of things equipment
CN114915468B (en) Intelligent analysis and detection method for network crime based on knowledge graph
CN115757991A (en) Webpage identification method and device, electronic equipment and storage medium
CN116305113A (en) Executable file detection method, device, equipment and storage medium
CN109889471B (en) Structured Query Language (SQL) injection detection method and system
CN114154043A (en) Website fingerprint calculation method, system, storage medium and terminal
CN113449816A (en) Website classification model training method, website classification method, device, equipment and medium
CN113965377A (en) Attack behavior detection method and device
CN115314268B (en) Malicious encryption traffic detection method and system based on traffic fingerprint and behavior
CN117176483A (en) Abnormal URL identification method and device and related products
CN110705258A (en) Text entity identification method and device
CN106982147A (en) The communication monitoring method and device of a kind of Web communication applications
CN115392238A (en) Equipment identification method, device, equipment and readable storage medium
CN114048311A (en) Phishing early warning method, device, equipment and storage medium
CN116383029B (en) User behavior label generation method and device based on small program
CN113612727B (en) Attack IP identification method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination