CN110765393A - Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression - Google Patents

Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression Download PDF

Info

Publication number
CN110765393A
CN110765393A CN201910873712.7A CN201910873712A CN110765393A CN 110765393 A CN110765393 A CN 110765393A CN 201910873712 A CN201910873712 A CN 201910873712A CN 110765393 A CN110765393 A CN 110765393A
Authority
CN
China
Prior art keywords
url
vector
harmful
vectorization
recognized
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910873712.7A
Other languages
Chinese (zh)
Inventor
王嘉伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weimeng Chuangke Network Technology China Co Ltd
Original Assignee
Weimeng Chuangke Network Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weimeng Chuangke Network Technology China Co Ltd filed Critical Weimeng Chuangke Network Technology China Co Ltd
Priority to CN201910873712.7A priority Critical patent/CN110765393A/en
Publication of CN110765393A publication Critical patent/CN110765393A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic

Abstract

The embodiment of the invention provides a method and a device for identifying harmful URL based on vectorization and logistic regression, comprising the following steps: extracting a plurality of Uniform Resource Locators (URLs) which are determined to be harmful requests and Uniform Resource Locators (URLs) to be identified; vectorizing each URL to respectively obtain corresponding multidimensional vectors; combining multidimensional vectors of URLs of all harmful requests to obtain a vector matrix X; for any URL to be identified, forming a vector matrix X1 to be identified and a tag vector Y1 to be identified; inputting a vector matrix X1 to be recognized and a label vector Y1 to be recognized into a trained logistic regression model for training to obtain a training result h (X1) of the URL to be recognized; when the training result h (X1) of the URL to be recognized is larger than a first threshold value, judging that the URL request to be recognized is a harmful request; and when the URL request is less than or equal to the first threshold value, judging the URL request to be identified as a harmless request.

Description

Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression
Technical Field
The invention relates to the field of computers, in particular to a method and a device for identifying harmful URLs based on vectorization and logistic regression.
Background
Websites are used to provide data output to users, however some users are for a variety of reasons, by using machines to simulate human beings making web page access requests. Access by machine is typically frequent and access volume is large. In addition, the machine is used for simulating human beings to carry out webpage access requests, which is usually a lawless person, and has an illegal purpose, and the core data of the website is crawled or the core interface is massively brushed. The anti-grab station system is generally adopted to block the abnormal access of the part.
If a lawbreaker uses multiple ip initiation requests, the following characteristics are provided, firstly: the first is that lawless persons write a station-swiping request script on the computer of the lawless persons, and then actively replace the ip of the lawless persons after requesting for a certain number of times or a certain time. Secondly, lawless persons deploy their own brushing scripts on some cloud server products, because the dynamics of the cloud server ip results in multiple ip accesses. Although the above operations are different in ip, since the request comes from the same script, the URL thereof has a certain rule, and a machine learning method can be used to determine whether the request is a harmful request.
In the process of implementing the invention, the applicant finds that at least the following problems exist in the prior art:
however, for URLs without any regularity to follow, manual tagging is required, such as:
/2/statuses/show?wm=3333_2001&b=0&from=1085199020&c=iphone&networktype=wifi&v_p=60&skin=default&v_f=1&lang=zh_CN&ua=iPad4,1__weibo__8.5.1__ipad__os10.1.1&s flag=1&ft=0&i=88e4c4f&did=bb8d107ee05a3fc06be80c1098ad7159&checktoken=49e5c194bd eed7cc8bbeaab67504cff1&gsid=&aid=01AgbVmfmoJjQmRb-L-ai9ITx0e88OqSta3GqK-53w72033U8.&s=&moduleID=feed&uicode=10000002&id=4372302667956367&luicode=20000061&_status_id=4372302667956367&mid=4372302667956367&has_member=1&lfid=universallin k&isGetLongText=1
after the URL is manually marked, the URL is compared with the & v _ p ═ 60 string found to be a bad request, and the URL can be simply determined to be a bad request because the ratio of the URL previously marked as a bad request exceeds 99.9%.
However, this method employs purely manual labeling, which is labor intensive; in addition, when the existing character string (a specific character string) is used to compare whether the character string exists in the URL or not to determine whether the URL is a harmful request, the URL has the uniqueness, and all the URL can be easily found and can be bypassed by lawless persons.
Disclosure of Invention
The embodiment of the invention provides a method and a device for identifying a harmful URL based on vectorization and logistic regression.
To achieve the above objects, in one aspect, an embodiment of the present invention provides a method for identifying a harmful URL based on vectorization and logistic regression, including:
extracting a plurality of Uniform Resource Locators (URLs) which are determined to be harmful requests and Uniform Resource Locators (URLs) to be identified;
vectorizing each URL to respectively obtain corresponding multidimensional vectors; combining multidimensional vectors of URLs of all harmful requests to obtain a vector matrix X, wherein the multidimensional vector of the URL of each harmful request is used as one row of the vector matrix X;
vectorizing the label corresponding to each URL; combining the tags of all the harmful requests after URL vectorization to obtain a tag vector Y;
inputting the vector matrix X and the label vector Y into a logistic regression model for training, and obtaining a trained logistic regression model after the training is finished;
for any URL to be identified, adding the multidimensional vector of the URL to be identified into a vector matrix X as a row to form a vector matrix X1 to be identified; adding the label of the URL to be recognized after vectorization into a label vector Y to form a label vector Y1 to be recognized; inputting a vector matrix X1 to be recognized and a label vector Y1 to be recognized into a trained logistic regression model for training to obtain a training result h (X1) of the URL to be recognized; and
when the training result h (X1) of the URL to be recognized is greater than a preset first threshold value, judging that the URL request to be recognized is a harmful request; and when the training result h (X1) of the URL to be recognized is less than or equal to a first threshold value, judging that the URL request to be recognized is a harmless request.
In another aspect, an embodiment of the present invention provides an apparatus for identifying a harmful URL based on vectorization and logistic regression, including:
the extraction device comprises: the uniform resource locator URL extraction module is used for extracting a plurality of Uniform Resource Locators (URLs) which are determined to be harmful requests and Uniform Resource Locators (URLs) to be identified;
a vectorization unit: the system is used for vectorizing each URL to respectively obtain corresponding multidimensional vectors; combining multidimensional vectors of URLs of all harmful requests to obtain a vector matrix X, wherein the multidimensional vector of the URL of each harmful request is used as one row of the vector matrix X; the label vectorization module is used for vectorizing the label corresponding to each URL; combining the tags of all the harmful requests after URL vectorization to obtain a tag vector Y;
a training unit: the logistic regression model is used for inputting the vector matrix X and the vector Y into the logistic regression model for training, and after the training is finished, the trained logistic regression model is obtained; for any URL to be identified, adding the multidimensional vector of the URL to be identified into a vector matrix X as a row to form a vector matrix X1 to be identified; adding the label of the URL to be recognized after vectorization into a label vector Y to form a label vector Y1 to be recognized; inputting a vector matrix X1 to be recognized and a label vector Y1 to be recognized into a trained logistic regression model for training to obtain a training result h (X1) of the URL to be recognized; and
a judging unit: the method is used for judging that the URL request to be recognized is a harmful request when the training result h (X1) of the URL to be recognized is larger than a preset first threshold value; and when the training result h (X1) of the URL to be recognized is less than or equal to a first threshold value, judging that the URL request to be recognized is a harmless request.
The technical scheme has the following beneficial effects: the parameter pieces of the URLs of the harmful requests are fragmented and converted into binary character strings, the binary character strings are vectorized to form a vector matrix, and information contained in the character strings in each URL can be fully utilized.
Because logistic regression is a machine language that can learn automatically, the URL vector matrix of multiple harmful requests is trained using logistic regression models, so that the characteristics of harmful request URLs can be determined. The URL to be recognized is vectorized, a vector matrix is formed together with the vector of the harmful request URL trained by the model trained by the logistic regression, the vector matrix is trained by the logistic regression model, and the similarity between the URL to be recognized and the harmful request URL can be judged according to the training result of the harmful request URL marked by the manual during the logistic regression training, so that whether the URL to be recognized is a harmful request or a normal request can be judged, the URL to be recognized is automatically classified, and manual operation is not needed.
By the method and the device, when the website server receives the request on line, the received URL can be more effectively judged in real time through logistic regression, and the URL is not easy to be discovered and bypassed by a requester, so that the judgment efficiency and the judgment accuracy are improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart illustrating a method for identifying harmful URLs based on vectorization and logistic regression according to an embodiment of the present invention;
FIG. 2 is a block diagram of an apparatus for identifying harmful URLs based on vectorization and logistic regression according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Definitions of some abbreviations and key terms to which the present invention relates
The anti-grabbing station system: websites output data to users, some of whom for various reasons use machines to simulate human web page access requests. Such machine accesses are typically large and frequent, and can adversely affect the health of the server. The anti-seize station system is a system for blocking the abnormal access of the part. And the anti-capture station system analyzes the real-time access log, judges the capture station ip and maintains a database for preventing the ip from being blocked.
Uniform resource locator URL: an example of a URL is as follows:
abc.com/user?u=1&cm=44
com is the domain name,/user is the interface, u 1, cm 44 are the parameters in the URL.
And (3) logistic regression: a machine learning classification algorithm can be trained on a number of labeled data and then perform a classification prediction on the data.
As shown in fig. 1, a flowchart of a method for identifying a harmful URL based on vectorization and logistic regression according to an embodiment of the present invention is provided, where the method for identifying a harmful URL based on vectorization and logistic regression includes:
s101: extracting a plurality of Uniform Resource Locators (URLs) which are determined to be harmful requests and Uniform Resource Locators (URLs) to be identified;
s102: vectorizing each URL to respectively obtain corresponding multidimensional vectors; combining multidimensional vectors of URLs of all harmful requests to obtain a vector matrix X, wherein the multidimensional vector of the URL of each harmful request is used as one row of the vector matrix X;
vectorizing the label corresponding to each URL; combining the tags of all the harmful requests after URL vectorization to obtain a tag vector Y;
s103: inputting the vector matrix X and the label vector Y into a logistic regression model for training, and obtaining a trained logistic regression model after the training is finished;
for any URL to be identified, adding the multidimensional vector of the URL to be identified into a vector matrix X as a row to form a vector matrix X1 to be identified; adding the label of the URL to be recognized after vectorization into a label vector Y to form a label vector Y1 to be recognized; inputting a vector matrix X1 to be recognized and a label vector Y1 to be recognized into a trained logistic regression model for training to obtain a training result h (X1) of the URL to be recognized; and
s104: when the training result h (X1) of the URL to be recognized is greater than a preset first threshold value, judging that the URL request to be recognized is a harmful request; and when the training result h (X1) of the URL to be recognized is less than or equal to a first threshold value, judging that the URL request to be recognized is a harmless request.
Preferably, vectorizing each URL to obtain a corresponding multidimensional vector respectively includes:
s1021: sequentially partitioning the parameters of each URL by using the symbol "&" in the URL as a partitioning point to obtain h parameter pieces; converting each parameter piece into a binary character string, and taking the first k bits of each parameter piece to form a k-dimensional vector;
and sequentially arranging the k-dimensional vectors of each parameter piece to obtain the multi-dimensional vector of each URL, wherein the element number of the multi-dimensional vector of each URL is h x k.
Preferably, converting each parameter chip into a binary string comprises:
s1021-1: and traversing each parameter piece of the URL in sequence through a Hash function Hash, calculating to obtain a Hash value of each parameter piece, taking an absolute value of the Hash value, and converting the absolute value into a binary string.
Preferably, the method further comprises the following steps:
s1022: and adjusting the number h x k of elements in the multidimensional vector of each URL so that the number of elements in the multidimensional vector of all URLs is the same.
Preferably, the tag corresponding to the URL includes: harmful requests, harmless requests and to-be-identified;
vectorizing the tag corresponding to each URL, specifically including: and if the label corresponding to the URL is a harmless request, setting the label vectorization to be 0, if the label corresponding to the URL is a harmful request, setting the label vectorization to be 1, and if the label corresponding to the URL is to be identified, setting the label vectorization to be an unknown number y.
As shown in fig. 2, a schematic structural diagram of an apparatus for identifying a harmful URL based on vectorization and logistic regression according to an embodiment of the present invention provides an apparatus for identifying a harmful URL based on vectorization and logistic regression, including:
the extraction device 21: the uniform resource locator URL extraction module is used for extracting a plurality of Uniform Resource Locators (URLs) which are determined to be harmful requests and Uniform Resource Locators (URLs) to be identified;
the vectorization unit 22: the system is used for vectorizing each URL to respectively obtain corresponding multidimensional vectors; combining multidimensional vectors of URLs of all harmful requests to obtain a vector matrix X, wherein the multidimensional vector of the URL of each harmful request is used as one row of the vector matrix X; vectorizing the label corresponding to each URL; combining the tags of all the harmful requests after URL vectorization to obtain a tag vector Y;
the training unit 23: the logistic regression model is used for inputting the vector matrix X and the vector Y into the logistic regression model for training, and after the training is finished, the trained logistic regression model is obtained; for any URL to be identified, adding the multidimensional vector of the URL to be identified into a vector matrix X as a row to form a vector matrix X1 to be identified; adding the label of the URL to be recognized after vectorization into a label vector Y to form a label vector Y1 to be recognized; inputting a vector matrix X1 to be recognized and a label vector Y1 to be recognized into a trained logistic regression model for training to obtain a training result h (X1) of the URL to be recognized; and
the judgment unit 24: the method is used for judging that the URL request to be recognized is a harmful request when the training result h (X1) of the URL to be recognized is larger than a preset first threshold value; and when the training result h (X1) of the URL to be recognized is less than or equal to a first threshold value, judging that the URL request to be recognized is a harmless request.
Preferably, the vectorization unit 22 comprises:
parameter plate transformation subunit 221: the method comprises the steps of sequentially partitioning parameters of each URL by using a symbol "&" in the URL as a partitioning point to obtain a plurality of parameter pieces; converting each parameter piece into a binary character string, and taking the first k bits of each parameter piece to form a k-dimensional vector;
and sequentially arranging the k-dimensional vectors of each parameter piece to obtain the multi-dimensional vector of each URL, wherein the element number of the multi-dimensional vector of each URL is h x k.
Preferably, the parameter plate transformant unit 221: the method is specifically used for sequentially traversing each parameter piece of the URL through a Hash function Hash, calculating to obtain a Hash value of each parameter piece, taking an absolute value of the Hash value, and converting the absolute value into a binary string.
Preferably, the vectoring unit 22 further comprises:
the warping subunit 222: the number h x k of elements in the multidimensional vector for each URL is adjusted such that the number of elements in the multidimensional vector for all URLs is the same.
Preferably, the tag corresponding to the URL includes: harmful requests, harmless requests and to-be-identified;
the vectorization unit is specifically configured to: and if the label corresponding to the URL is a harmless request, setting the label vectorization to be 0, if the label corresponding to the URL is a harmful request, setting the label vectorization to be 1, and if the label corresponding to the URL is to be identified, setting the label vectorization to be an unknown number y.
The technical scheme of the embodiment of the invention has the following beneficial effects: the parameter pieces of the URLs of the harmful requests are fragmented and converted into binary character strings, the binary character strings are vectorized to form a vector matrix, and information contained in the character strings in each URL can be fully utilized.
Because logistic regression is a machine language that can learn automatically, the URL vector matrix of multiple harmful requests is trained using logistic regression models, so that the characteristics of harmful request URLs can be determined. And converting the URL to be recognized into a binary character string through a hash function, vectorizing the binary character string, forming a vector matrix together with a vector of the harmful request URL trained by the model trained through logistic regression, carrying out logistic regression model training on the vector matrix, and judging the similarity between the URL to be recognized and the harmful request URL according to a training result of the harmful request URL marked manually during the logistic regression training so as to judge whether the URL to be recognized is a harmful request or a normal request, thereby automatically classifying the URL without manual operation.
By the method and the device, when the website server receives the request on line, the received URL can be more effectively judged in real time through logistic regression, and the URL is not easy to be discovered and bypassed by a requester, so that the judgment efficiency and the judgment accuracy are improved.
The above technical solutions of the embodiments of the present invention are described in detail below with reference to application examples, and reference may be made to the foregoing related descriptions for technical details that are not described in the implementation process.
In the prior art, although the URL is a very long character string, in the very long character string, whether the request is a harmful request can be judged only by judging whether a specific character string is contained in the long character string, and all the character strings of the URL cannot be fully utilized.
The method and the device of the invention fully utilize the information contained in the character string of the URL, can effectively judge the harmfulness of the character string, and are not easy to be found and bypassed. Extracting URLs of a batch of requests, marking the batch of URLs, dividing the batch of URLs into normal requests and harmful requests, and vectorizing labels corresponding to the URLs, wherein 1 represents a harmful request URL, and 0 represents a normal request URL). The following specific operations are carried out for the URL of the harmful request:
firstly, dividing words of URL: the URLs are sliced by the symbol "&" contained therein, and each URL is formed into a plurality of parameter pieces L: { a1, a2, A3, a4 …. Ah }, for a total of h parameter pieces.
Secondly, defining a Hash Hash function: the Hash is a function method for converting character strings into numbers, after the character strings are converted by the Hash function, the same character strings have the same Hash value, and different character strings have different Hash values.
Thirdly, converting the parameter pieces of each URL: and for each parameter piece set L, traversing each parameter piece Ah by a Hash function, firstly respectively calculating a Hash value Hi of each Ah, then taking the Hash value Hi to take an absolute value, thirdly converting the Hi into a binary format of the character string, and finally taking the first k bits of the binary character string to form a dimension vector of k (1 x k).
Fourthly, splicing vectors: and for each URL, each parameter piece Ah becomes a dimension vector of k, and finally, each URL is spliced to form a vector X of (h X k) dimension, wherein h is the number of the parameter pieces L.
Fifthly, data normalization: and taking a constant s as the number of the maximum parameter pieces of each URL after the data is structured. If h is greater than s, taking binary character strings corresponding to the first s parameter pieces Ah in h; if h < s, performing complementation in front of the (h × k) -dimensional vector X to reach the number of s parameter pieces, namely supplementing (s-h) k-dimensional vectors, wherein each supplemented k-dimensional vector is a full 0 vector, namely supplementing (s-h) k-dimensional 0 vectors to the forefront of the existing h-dimensional vectors. The purpose of this step is to make the number of dimensions of the spliced vector formed by all URLs be s × k, so that the formed data is regular, which is a method for changing URLs into vectors X.
And sixthly, vectorizing the label corresponding to each harmful request URL, and combining the vectorized labels of each harmful request URL to obtain a label vector.
When the method is implemented, a batch of harmful requests are marked manually, the labels of the harmful requests are harmful requests, harmful requests are represented by 1, (if the harmful requests are normal requests, the labels are normal requests, and normal requests are represented by 0), label vectors Y [1,1,1,1 … … ] corresponding to the batch of harmful requests are formed, and a logistic regression model is trained by using the vectors X and the label vectors Y corresponding to the URLs of the batch of harmful requests. And then the logistic regression model can learn the corresponding relation between X and Y by utilizing gradient descent, the training result obtained on the X of the training model after the training is finished is similar to the label Y, and the trained logistic regression model is obtained after the training is finished. Then, a vector matrix X1 is formed by URLs containing to-be-identified URLs (namely, whether the labels are harmful requests or normal requests in unknown states) and previous harmful requests, the labels of the to-be-identified URLs after vectorization are added into a label vector Y to form a to-be-identified label vector Y1, the vector matrix X1 and the label vector Y1 are brought into a previously-trained logistic regression model aiming at the harmful request URLs to be trained, according to the characteristics of logistic regression, the characteristics of data in the previously-trained vector matrix X can be learned to obtain corresponding training results, according to the returned results, the judgment results (such as 0.98, 0.01 and the like) are obtained, a user can judge whether the to-be-identified URLs are harmful requests or normal requests, and therefore classification of the URLs is automatically completed without manual operation. Therefore, when the website server receives the request on line, the received URL can be classified in real time through the logistic regression model, the judgment of the harmful request or the normal request is completed, and if the request is the harmful request, the access can be directly refused.
The technical solution of the present invention is detailed below by specific examples:
now suppose that the URLs extracted by accessing the request log are such that the first four are artificially qualified harmful requests (each tag vector is 1), and the 5 th and 6 th are URLs to be identified, i.e. whether harmful or normal requests are in an unknown state:
(1)abc.com/u?ntype=wifi&d=1001&u=gas
(2)abc.com/u?ntype=wifi&d=1001&u=gms
(3)abc.com/u?ntype=wifi&d=1001&u=gamk
(4)abc.com/u?ntype=wifi&d=1001&u=peas
(5)abc.com/u?ntype=3g&d=100299&u=monk&inter=true&iv=22ddac4f&mid=122
(6)abc.com/u?ntype=mobile&d=3282&u=onelifee&b=isc&mid=22399
for the (1) th log, the URL is:
ntype=wifi&d=1001&u=gas
firstly, dividing URL word, according to the word division and truncation, changing into type wifi, d 1001 and u gas
Secondly, the three parameter pieces are subjected to hash conversion, and after the hash function conversion is spoken, the three parameter pieces become:
the hash value corresponding to wifi is: 2893893923183264473,
the hash value for d 1001 is: 399101938414587701,
the hash value for u-gas is: 1773628803862046195
Then, converting the three hash values into binary character strings and taking absolute values, which are sequentially as follows:
0b10100000101001001011010011000011001101000000000000001011011001
0b10110001001111001010001100100110100010111110000111100110101
0b1100010011101001100100000000000000001100110010010010111110011
now, assuming that k is 8 for the first k bits of each binary string, and s is 4 for the constant s of the regular data, each URL finally forms a 32-dimensional vector.
Sequentially taking the first k bits of the three parameter pieces, namely: 10100000, 10110001, 11000100
The vector formed by splicing the first k bits is: 101000001011000111000100
Next, the data needs to be structured, the number h of the parameter pieces of the URL is 3, the constant s of the whole data is 4, so that 1 8-dimensional 0 vector 00000000 (0 is complemented to form a k × s vector) needs to be supplemented until "101000001011000111000100", and the structured vector X is: 00000000101000001011000111000100
When all 6 URLs are vectorized in turn, each URL is formed into a vector of up to 32 dimensions (k × s — 8 × 4 — 32).
(1)00000000101000001011000111000100
(2)00000000101000001110000111000100
(3)00000000101000001101100011000100
(4)00000000101000001111101111000100
(5)11110100010001000100010010100111
(6)00101001001001011101000100110001
Then the tag vector Y consisting of the first four harmful request URLs is [1,1,1,1 ].
The top 4 URLs marked as bad requests are trained by the logistic regression model, and it can be seen that these 4 URLs are relatively similar. At this time, a batch of URL data can be marked manually, the tag vector of the harmful request URL is marked as 1, and the tag vector of the normal request URL is marked as 0. Extracting harmful URL requests, as in the foregoing steps, vectorizing the character strings of the URLs of each harmful request to form 32-dimensional vectors, respectively, and combining the multidimensional vectors of the URLs of all harmful requests to obtain a vector matrix X, where the multidimensional vector of each URL is used as a row of the vector matrix X. And additionally forming a label vector Y corresponding to the harmful request URL. And then carrying out logistic regression model training.
The logistic regression model can adopt tensoflow.keras, and the logistic regression training by using the tensoflow.keras is simpler, more convenient and better to use, has short codes and can be trained by only a few lines. Writing the training data of the 4 harmful requests into a vector matrix X, writing the corresponding labels into Y, and then writing into a tensoflow.
model=Sequential()
model.add(Dense(1,activation=tf.nn.sigmoid,input_shape=(32,)))
model.compile(optimizer='adam',loss=tf.losses.sigmoid_cross_entropy,metrics=['acc'])
model.fit(x=X,y=Y,batch_size=64,epochs=10,validation_data=(Xtest,Ytest))
Training the logistic regression model after the logistic regression model is written, and returning the URL training result of the harmful request after the training is finished:
model.predict(X_topredict)
then, a vector matrix X1 is formed by the 5 th URL, the 6 th URL and the 1 st-4 th multidimensional vectors, the labels of the URLs to be recognized are vectorized and then added into a label vector Y to form a label vector Y1 to be recognized, the vector matrix X1 and the label vector Y1 are brought into a logistic regression model prediction (X _ prediction) for training, according to the characteristics of logistic regression, the characteristics of data in the previous training vector matrix X can be learned, according to the previous training result model prediction (X _ prediction), training results h (X1) of the 5 th URL and the 6 th URL are obtained, according to the result h (X1) returned for the 5 th URL and the 6 th URL, the 5 th URL and the 6 th URL can be classified, and when the training result h (X1) of the URLs to be recognized is larger than a first threshold value, the URL to be recognized is judged as a harmful request; when the training result h (X1) of the URL to be recognized is equal to or less than the first threshold, it is determined that the URL request to be recognized is a harmless request, and it is possible to determine whether the 5 th and 6 th URLs are harmful requests. The X _ prediction is data to be predicted which also needs URL vectorization processing and is written into a matrix (the multidimensional vector of each URL to be recognized is added into a vector matrix X as a line to form a vector matrix X1 to be recognized, and a label corresponding to the URL to be recognized is vectorized and then added into a label vector Y to form a label vector Y1 to be recognized).
The parameter pieces of the URLs of the harmful requests are fragmented and converted into binary character strings, the binary character strings are vectorized to form a vector matrix, and information contained in the character strings in each URL can be fully utilized.
Because logistic regression is a machine language that can learn automatically, the URL vector matrix of multiple harmful requests is trained using logistic regression models, so that the characteristics of harmful request URLs can be determined. And converting the URL to be recognized into a binary character string through a hash function, vectorizing the binary character string, forming a vector matrix together with a vector of the harmful request URL trained by the model trained through logistic regression, carrying out logistic regression model training on the vector matrix, and judging the similarity between the URL to be recognized and the harmful request URL according to a training result of the harmful request URL marked manually during the logistic regression training so as to judge whether the URL to be recognized is a harmful request or a normal request, thereby automatically classifying the URL without manual operation.
By the method and the device, when the website server receives the request on line, the received URL can be more effectively judged in real time through logistic regression, and the URL is not easy to be discovered and bypassed by a requester, so that the judgment efficiency and the judgment accuracy are improved.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".
Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
The various illustrative logical blocks, or elements, described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.
In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A method for identifying harmful URLs based on vectorization and logistic regression, comprising:
extracting a plurality of Uniform Resource Locators (URLs) which are determined to be harmful requests and Uniform Resource Locators (URLs) to be identified;
vectorizing each URL to respectively obtain corresponding multidimensional vectors; combining multidimensional vectors of URLs of all harmful requests to obtain a vector matrix X, wherein the multidimensional vector of the URL of each harmful request is used as one row of the vector matrix X;
vectorizing the label corresponding to each URL; combining the tags of all the harmful requests after URL vectorization to obtain a tag vector Y;
inputting the vector matrix X and the label vector Y into a logistic regression model for training, and obtaining a trained logistic regression model after the training is finished;
for any URL to be identified, adding the multidimensional vector of the URL to be identified into a vector matrix X as a row to form a vector matrix X1 to be identified; adding the label of the URL to be recognized after vectorization into a label vector Y to form a label vector Y1 to be recognized; inputting a vector matrix X1 to be recognized and a label vector Y1 to be recognized into a trained logistic regression model for training to obtain a training result h (X1) of the URL to be recognized; and
when the training result h (X1) of the URL to be recognized is greater than a preset first threshold value, judging that the URL request to be recognized is a harmful request; and when the training result h (X1) of the URL to be recognized is less than or equal to a first threshold value, judging that the URL request to be recognized is a harmless request.
2. The method for identifying harmful URLs based on vectorization and logistic regression as claimed in claim 1, wherein vectorizing each URL to obtain a corresponding multidimensional vector respectively comprises:
sequentially partitioning the parameters of each URL by using the symbol "&" in the URL as a partitioning point to obtain h parameter pieces; converting each parameter piece into a binary character string, and taking the first k bits of each parameter piece to form a k-dimensional vector;
and sequentially arranging the k-dimensional vectors of each parameter piece to obtain the multi-dimensional vector of each URL, wherein the element number of the multi-dimensional vector of each URL is h x k.
3. The method of identifying harmful URLs based on vectorization and logistic regression according to claim 2, wherein converting each parameter piece into a binary string comprises:
and traversing each parameter piece of the URL in sequence through a Hash function Hash, calculating to obtain a Hash value of each parameter piece, taking an absolute value of the Hash value, and converting the absolute value into a binary string.
4. The method of identifying harmful URLs based on vectorization and logistic regression as recited in claim 2, further comprising:
and adjusting the number h x k of elements in the multidimensional vector of each URL so that the number of elements in the multidimensional vector of all URLs is the same.
5. The method of claim 1 for identifying harmful URLs based on vectorization and logistic regression, wherein the tags corresponding to the URLs comprise: harmful requests, harmless requests and to-be-identified;
vectorizing the tag corresponding to each URL, specifically including: and if the label corresponding to the URL is a harmless request, setting the label vectorization to be 0, if the label corresponding to the URL is a harmful request, setting the label vectorization to be 1, and if the label corresponding to the URL is to be identified, setting the label vectorization to be an unknown number y.
6. An apparatus for identifying harmful URLs based on vectorization and logistic regression, comprising:
the extraction device comprises: the uniform resource locator URL extraction module is used for extracting a plurality of Uniform Resource Locators (URLs) which are determined to be harmful requests and Uniform Resource Locators (URLs) to be identified;
a vectorization unit: the system is used for vectorizing each URL to respectively obtain corresponding multidimensional vectors; combining multidimensional vectors of URLs of all harmful requests to obtain a vector matrix X, wherein the multidimensional vector of the URL of each harmful request is used as one row of the vector matrix X; the label vectorization module is used for vectorizing the label corresponding to each URL; combining the tags of all the harmful requests after URL vectorization to obtain a tag vector Y;
a training unit: the logistic regression model is used for inputting the vector matrix X and the vector Y into the logistic regression model for training, and after the training is finished, the trained logistic regression model is obtained; for any URL to be identified, adding the multidimensional vector of the URL to be identified into a vector matrix X as a row to form a vector matrix X1 to be identified; adding the label of the URL to be recognized after vectorization into a label vector Y to form a label vector Y1 to be recognized; inputting a vector matrix X1 to be recognized and a label vector Y1 to be recognized into a trained logistic regression model for training to obtain a training result h (X1) of the URL to be recognized;
a judging unit: the method is used for judging that the URL request to be recognized is a harmful request when the training result h (X1) of the URL to be recognized is larger than a preset first threshold value; and when the training result h (X1) of the URL to be recognized is less than or equal to a first threshold value, judging that the URL request to be recognized is a harmless request.
7. The apparatus for identifying harmful URLs based on vectorization and logistic regression according to claim 6, wherein the vectorization unit includes:
parameter plate transformation subunit: the method comprises the steps of sequentially partitioning parameters of each URL by using a symbol "&" in the URL as a partitioning point to obtain a plurality of parameter pieces; converting each parameter piece into a binary character string, and taking the first k bits of each parameter piece to form a k-dimensional vector;
and sequentially arranging the k-dimensional vectors of each parameter piece to obtain the multi-dimensional vector of each URL, wherein the element number of the multi-dimensional vector of each URL is h x k.
8. The apparatus for identifying harmful URLs based on vectorization and logistic regression as recited in claim 7,
parameter plate transformation subunit: the method is specifically used for sequentially traversing each parameter piece of the URL through a Hash function Hash, calculating to obtain a Hash value of each parameter piece, taking an absolute value of the Hash value, and converting the absolute value into a binary string.
9. The apparatus for identifying harmful URLs based on vectorization and logistic regression as recited in claim 7, wherein the vectorization unit further comprises:
a regulating subunit: the number h x k of elements in the multidimensional vector for each URL is adjusted such that the number of elements in the multidimensional vector for all URLs is the same.
10. The apparatus for identifying harmful URLs based on vectorization and logistic regression as recited in claim 6, wherein the tags corresponding to the URLs comprise: harmful requests, harmless requests and to-be-identified;
the vectorization unit is specifically configured to: and if the label corresponding to the URL is a harmless request, setting the label vectorization to be 0, if the label corresponding to the URL is a harmful request, setting the label vectorization to be 1, and if the label corresponding to the URL is to be identified, setting the label vectorization to be an unknown number y.
CN201910873712.7A 2019-09-17 2019-09-17 Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression Pending CN110765393A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910873712.7A CN110765393A (en) 2019-09-17 2019-09-17 Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910873712.7A CN110765393A (en) 2019-09-17 2019-09-17 Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression

Publications (1)

Publication Number Publication Date
CN110765393A true CN110765393A (en) 2020-02-07

Family

ID=69329498

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910873712.7A Pending CN110765393A (en) 2019-09-17 2019-09-17 Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression

Country Status (1)

Country Link
CN (1) CN110765393A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149037A (en) * 2020-09-28 2020-12-29 微梦创科网络科技(中国)有限公司 Method and system for identifying abnormal attention in real time based on logistic regression
CN116188091A (en) * 2023-05-04 2023-05-30 品茗科技股份有限公司 Method, device, equipment and medium for automatic matching unit price reference of cost list

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106713335A (en) * 2016-12-30 2017-05-24 山石网科通信技术有限公司 Malicious software identification method and device
CN108023868A (en) * 2016-10-31 2018-05-11 腾讯科技(深圳)有限公司 Malice resource address detection method and device
CN108364028A (en) * 2018-03-06 2018-08-03 中国科学院信息工程研究所 A kind of internet site automatic classification method based on deep learning
CN108616498A (en) * 2018-02-24 2018-10-02 国家计算机网络与信息安全管理中心 A kind of web access exceptions detection method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108023868A (en) * 2016-10-31 2018-05-11 腾讯科技(深圳)有限公司 Malice resource address detection method and device
CN106713335A (en) * 2016-12-30 2017-05-24 山石网科通信技术有限公司 Malicious software identification method and device
CN108616498A (en) * 2018-02-24 2018-10-02 国家计算机网络与信息安全管理中心 A kind of web access exceptions detection method and device
CN108364028A (en) * 2018-03-06 2018-08-03 中国科学院信息工程研究所 A kind of internet site automatic classification method based on deep learning

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149037A (en) * 2020-09-28 2020-12-29 微梦创科网络科技(中国)有限公司 Method and system for identifying abnormal attention in real time based on logistic regression
CN112149037B (en) * 2020-09-28 2024-03-19 微梦创科网络科技(中国)有限公司 Method and system for identifying abnormal attention in real time based on logistic regression
CN116188091A (en) * 2023-05-04 2023-05-30 品茗科技股份有限公司 Method, device, equipment and medium for automatic matching unit price reference of cost list

Similar Documents

Publication Publication Date Title
CN111062495B (en) Machine learning method and related device
CN110347835B (en) Text clustering method, electronic device and storage medium
US10380236B1 (en) Machine learning system for annotating unstructured text
CN105354307B (en) Image content identification method and device
CN109978060B (en) Training method and device of natural language element extraction model
US11610064B2 (en) Clarification of natural language requests using neural networks
CN113159095A (en) Model training method, image retrieval method and device
CN111966914B (en) Content recommendation method and device based on artificial intelligence and computer equipment
CN113569135B (en) Recommendation method, device, computer equipment and storage medium based on user portrait
KR101837262B1 (en) Deep learning type classification method with feature-based weighting
CN112199602B (en) Post recommendation method, recommendation platform and server
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN111382572A (en) Named entity identification method, device, equipment and medium
CN112487149A (en) Text auditing method, model, equipment and storage medium
CN114510939A (en) Entity relationship extraction method and device, electronic equipment and storage medium
CN116601626A (en) Personal knowledge graph construction method and device and related equipment
CN111625715A (en) Information extraction method and device, electronic equipment and storage medium
CN111914159A (en) Information recommendation method and terminal
CN110765393A (en) Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression
WO2020114109A1 (en) Interpretation method and apparatus for embedding result
CN110717019A (en) Question-answering processing method, question-answering system, electronic device and medium
CN112818126A (en) Training method, application method and device for network security corpus construction model
CN117009621A (en) Information searching method, device, electronic equipment, storage medium and program product
CN111126420A (en) Method and device for establishing recognition model
CN116090538A (en) Model weight acquisition method and related system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200207