CN110765393A

CN110765393A - Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression

Info

Publication number: CN110765393A
Application number: CN201910873712.7A
Authority: CN
Inventors: 王嘉伟
Original assignee: Weimeng Chuangke Network Technology China Co Ltd
Current assignee: Weimeng Chuangke Network Technology China Co Ltd
Priority date: 2019-09-17
Filing date: 2019-09-17
Publication date: 2020-02-07

Abstract

The embodiment of the invention provides a method and a device for identifying harmful URL based on vectorization and logistic regression, comprising the following steps: extracting a plurality of Uniform Resource Locators (URLs) which are determined to be harmful requests and Uniform Resource Locators (URLs) to be identified; vectorizing each URL to respectively obtain corresponding multidimensional vectors; combining multidimensional vectors of URLs of all harmful requests to obtain a vector matrix X; for any URL to be identified, forming a vector matrix X1 to be identified and a tag vector Y1 to be identified; inputting a vector matrix X1 to be recognized and a label vector Y1 to be recognized into a trained logistic regression model for training to obtain a training result h (X1) of the URL to be recognized; when the training result h (X1) of the URL to be recognized is larger than a first threshold value, judging that the URL request to be recognized is a harmful request; and when the URL request is less than or equal to the first threshold value, judging the URL request to be identified as a harmless request.

Description

Method and device for identifying harmful URL (uniform resource locator) based on vectorization and logistic regression

Technical Field

The invention relates to the field of computers, in particular to a method and a device for identifying harmful URLs based on vectorization and logistic regression.

Background

Websites are used to provide data output to users, however some users are for a variety of reasons, by using machines to simulate human beings making web page access requests. Access by machine is typically frequent and access volume is large. In addition, the machine is used for simulating human beings to carry out webpage access requests, which is usually a lawless person, and has an illegal purpose, and the core data of the website is crawled or the core interface is massively brushed. The anti-grab station system is generally adopted to block the abnormal access of the part.

If a lawbreaker uses multiple ip initiation requests, the following characteristics are provided, firstly: the first is that lawless persons write a station-swiping request script on the computer of the lawless persons, and then actively replace the ip of the lawless persons after requesting for a certain number of times or a certain time. Secondly, lawless persons deploy their own brushing scripts on some cloud server products, because the dynamics of the cloud server ip results in multiple ip accesses. Although the above operations are different in ip, since the request comes from the same script, the URL thereof has a certain rule, and a machine learning method can be used to determine whether the request is a harmful request.

In the process of implementing the invention, the applicant finds that at least the following problems exist in the prior art:

however, for URLs without any regularity to follow, manual tagging is required, such as:

/2/statuses/show？wm＝3333_2001&b＝0&from＝1085199020&c＝iphone&networktype＝wifi&v_p＝60&skin＝default&v_f＝1&lang＝zh_CN&ua＝iPad4,1__weibo__8.5.1__ipad__os10.1.1&s flag＝1&ft＝0&i＝88e4c4f&did＝bb8d107ee05a3fc06be80c1098ad7159&checktoken＝49e5c194bd eed7cc8bbeaab67504cff1&gsid＝&aid＝01AgbVmfmoJjQmRb-L-ai9ITx0e88OqSta3GqK-53w72033U8.&s＝&moduleID＝feed&uicode＝10000002&id＝4372302667956367&luicode＝20000061&_status_id＝4372302667956367&mid＝4372302667956367&has_member＝1&lfid＝universallin k&isGetLongText＝1

after the URL is manually marked, the URL is compared with the & v _ p ═ 60 string found to be a bad request, and the URL can be simply determined to be a bad request because the ratio of the URL previously marked as a bad request exceeds 99.9%.

However, this method employs purely manual labeling, which is labor intensive; in addition, when the existing character string (a specific character string) is used to compare whether the character string exists in the URL or not to determine whether the URL is a harmful request, the URL has the uniqueness, and all the URL can be easily found and can be bypassed by lawless persons.

Disclosure of Invention

The embodiment of the invention provides a method and a device for identifying a harmful URL based on vectorization and logistic regression.

To achieve the above objects, in one aspect, an embodiment of the present invention provides a method for identifying a harmful URL based on vectorization and logistic regression, including:

extracting a plurality of Uniform Resource Locators (URLs) which are determined to be harmful requests and Uniform Resource Locators (URLs) to be identified;

vectorizing each URL to respectively obtain corresponding multidimensional vectors; combining multidimensional vectors of URLs of all harmful requests to obtain a vector matrix X, wherein the multidimensional vector of the URL of each harmful request is used as one row of the vector matrix X;

vectorizing the label corresponding to each URL; combining the tags of all the harmful requests after URL vectorization to obtain a tag vector Y;

inputting the vector matrix X and the label vector Y into a logistic regression model for training, and obtaining a trained logistic regression model after the training is finished;

for any URL to be identified, adding the multidimensional vector of the URL to be identified into a vector matrix X as a row to form a vector matrix X1 to be identified; adding the label of the URL to be recognized after vectorization into a label vector Y to form a label vector Y1 to be recognized; inputting a vector matrix X1 to be recognized and a label vector Y1 to be recognized into a trained logistic regression model for training to obtain a training result h (X1) of the URL to be recognized; and

when the training result h (X1) of the URL to be recognized is greater than a preset first threshold value, judging that the URL request to be recognized is a harmful request; and when the training result h (X1) of the URL to be recognized is less than or equal to a first threshold value, judging that the URL request to be recognized is a harmless request.

In another aspect, an embodiment of the present invention provides an apparatus for identifying a harmful URL based on vectorization and logistic regression, including:

the extraction device comprises: the uniform resource locator URL extraction module is used for extracting a plurality of Uniform Resource Locators (URLs) which are determined to be harmful requests and Uniform Resource Locators (URLs) to be identified;

a vectorization unit: the system is used for vectorizing each URL to respectively obtain corresponding multidimensional vectors; combining multidimensional vectors of URLs of all harmful requests to obtain a vector matrix X, wherein the multidimensional vector of the URL of each harmful request is used as one row of the vector matrix X; the label vectorization module is used for vectorizing the label corresponding to each URL; combining the tags of all the harmful requests after URL vectorization to obtain a tag vector Y;

a training unit: the logistic regression model is used for inputting the vector matrix X and the vector Y into the logistic regression model for training, and after the training is finished, the trained logistic regression model is obtained; for any URL to be identified, adding the multidimensional vector of the URL to be identified into a vector matrix X as a row to form a vector matrix X1 to be identified; adding the label of the URL to be recognized after vectorization into a label vector Y to form a label vector Y1 to be recognized; inputting a vector matrix X1 to be recognized and a label vector Y1 to be recognized into a trained logistic regression model for training to obtain a training result h (X1) of the URL to be recognized; and

a judging unit: the method is used for judging that the URL request to be recognized is a harmful request when the training result h (X1) of the URL to be recognized is larger than a preset first threshold value; and when the training result h (X1) of the URL to be recognized is less than or equal to a first threshold value, judging that the URL request to be recognized is a harmless request.

The technical scheme has the following beneficial effects: the parameter pieces of the URLs of the harmful requests are fragmented and converted into binary character strings, the binary character strings are vectorized to form a vector matrix, and information contained in the character strings in each URL can be fully utilized.

Because logistic regression is a machine language that can learn automatically, the URL vector matrix of multiple harmful requests is trained using logistic regression models, so that the characteristics of harmful request URLs can be determined. The URL to be recognized is vectorized, a vector matrix is formed together with the vector of the harmful request URL trained by the model trained by the logistic regression, the vector matrix is trained by the logistic regression model, and the similarity between the URL to be recognized and the harmful request URL can be judged according to the training result of the harmful request URL marked by the manual during the logistic regression training, so that whether the URL to be recognized is a harmful request or a normal request can be judged, the URL to be recognized is automatically classified, and manual operation is not needed.

By the method and the device, when the website server receives the request on line, the received URL can be more effectively judged in real time through logistic regression, and the URL is not easy to be discovered and bypassed by a requester, so that the judgment efficiency and the judgment accuracy are improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flow chart illustrating a method for identifying harmful URLs based on vectorization and logistic regression according to an embodiment of the present invention;

FIG. 2 is a block diagram of an apparatus for identifying harmful URLs based on vectorization and logistic regression according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Definitions of some abbreviations and key terms to which the present invention relates

The anti-grabbing station system: websites output data to users, some of whom for various reasons use machines to simulate human web page access requests. Such machine accesses are typically large and frequent, and can adversely affect the health of the server. The anti-seize station system is a system for blocking the abnormal access of the part. And the anti-capture station system analyzes the real-time access log, judges the capture station ip and maintains a database for preventing the ip from being blocked.

Uniform resource locator URL: an example of a URL is as follows:

abc.com/user？u＝1&cm＝44

com is the domain name,/user is the interface, u 1, cm 44 are the parameters in the URL.

And (3) logistic regression: a machine learning classification algorithm can be trained on a number of labeled data and then perform a classification prediction on the data.

As shown in fig. 1, a flowchart of a method for identifying a harmful URL based on vectorization and logistic regression according to an embodiment of the present invention is provided, where the method for identifying a harmful URL based on vectorization and logistic regression includes:

s101: extracting a plurality of Uniform Resource Locators (URLs) which are determined to be harmful requests and Uniform Resource Locators (URLs) to be identified;

s102: vectorizing each URL to respectively obtain corresponding multidimensional vectors; combining multidimensional vectors of URLs of all harmful requests to obtain a vector matrix X, wherein the multidimensional vector of the URL of each harmful request is used as one row of the vector matrix X;

s103: inputting the vector matrix X and the label vector Y into a logistic regression model for training, and obtaining a trained logistic regression model after the training is finished;

s104: when the training result h (X1) of the URL to be recognized is greater than a preset first threshold value, judging that the URL request to be recognized is a harmful request; and when the training result h (X1) of the URL to be recognized is less than or equal to a first threshold value, judging that the URL request to be recognized is a harmless request.

Preferably, vectorizing each URL to obtain a corresponding multidimensional vector respectively includes:

s1021: sequentially partitioning the parameters of each URL by using the symbol "&" in the URL as a partitioning point to obtain h parameter pieces; converting each parameter piece into a binary character string, and taking the first k bits of each parameter piece to form a k-dimensional vector;

and sequentially arranging the k-dimensional vectors of each parameter piece to obtain the multi-dimensional vector of each URL, wherein the element number of the multi-dimensional vector of each URL is h x k.

Preferably, converting each parameter chip into a binary string comprises:

s1021-1: and traversing each parameter piece of the URL in sequence through a Hash function Hash, calculating to obtain a Hash value of each parameter piece, taking an absolute value of the Hash value, and converting the absolute value into a binary string.

Preferably, the method further comprises the following steps:

s1022: and adjusting the number h x k of elements in the multidimensional vector of each URL so that the number of elements in the multidimensional vector of all URLs is the same.

Preferably, the tag corresponding to the URL includes: harmful requests, harmless requests and to-be-identified;

vectorizing the tag corresponding to each URL, specifically including: and if the label corresponding to the URL is a harmless request, setting the label vectorization to be 0, if the label corresponding to the URL is a harmful request, setting the label vectorization to be 1, and if the label corresponding to the URL is to be identified, setting the label vectorization to be an unknown number y.

As shown in fig. 2, a schematic structural diagram of an apparatus for identifying a harmful URL based on vectorization and logistic regression according to an embodiment of the present invention provides an apparatus for identifying a harmful URL based on vectorization and logistic regression, including:

the extraction device 21: the uniform resource locator URL extraction module is used for extracting a plurality of Uniform Resource Locators (URLs) which are determined to be harmful requests and Uniform Resource Locators (URLs) to be identified;

the vectorization unit 22: the system is used for vectorizing each URL to respectively obtain corresponding multidimensional vectors; combining multidimensional vectors of URLs of all harmful requests to obtain a vector matrix X, wherein the multidimensional vector of the URL of each harmful request is used as one row of the vector matrix X; vectorizing the label corresponding to each URL; combining the tags of all the harmful requests after URL vectorization to obtain a tag vector Y;

the training unit 23: the logistic regression model is used for inputting the vector matrix X and the vector Y into the logistic regression model for training, and after the training is finished, the trained logistic regression model is obtained; for any URL to be identified, adding the multidimensional vector of the URL to be identified into a vector matrix X as a row to form a vector matrix X1 to be identified; adding the label of the URL to be recognized after vectorization into a label vector Y to form a label vector Y1 to be recognized; inputting a vector matrix X1 to be recognized and a label vector Y1 to be recognized into a trained logistic regression model for training to obtain a training result h (X1) of the URL to be recognized; and

the judgment unit 24: the method is used for judging that the URL request to be recognized is a harmful request when the training result h (X1) of the URL to be recognized is larger than a preset first threshold value; and when the training result h (X1) of the URL to be recognized is less than or equal to a first threshold value, judging that the URL request to be recognized is a harmless request.

Preferably, the vectorization unit 22 comprises:

parameter plate transformation subunit 221: the method comprises the steps of sequentially partitioning parameters of each URL by using a symbol "&" in the URL as a partitioning point to obtain a plurality of parameter pieces; converting each parameter piece into a binary character string, and taking the first k bits of each parameter piece to form a k-dimensional vector;

Preferably, the parameter plate transformant unit 221: the method is specifically used for sequentially traversing each parameter piece of the URL through a Hash function Hash, calculating to obtain a Hash value of each parameter piece, taking an absolute value of the Hash value, and converting the absolute value into a binary string.

Preferably, the vectoring unit 22 further comprises:

the warping subunit 222: the number h x k of elements in the multidimensional vector for each URL is adjusted such that the number of elements in the multidimensional vector for all URLs is the same.

the vectorization unit is specifically configured to: and if the label corresponding to the URL is a harmless request, setting the label vectorization to be 0, if the label corresponding to the URL is a harmful request, setting the label vectorization to be 1, and if the label corresponding to the URL is to be identified, setting the label vectorization to be an unknown number y.

The technical scheme of the embodiment of the invention has the following beneficial effects: the parameter pieces of the URLs of the harmful requests are fragmented and converted into binary character strings, the binary character strings are vectorized to form a vector matrix, and information contained in the character strings in each URL can be fully utilized.

Because logistic regression is a machine language that can learn automatically, the URL vector matrix of multiple harmful requests is trained using logistic regression models, so that the characteristics of harmful request URLs can be determined. And converting the URL to be recognized into a binary character string through a hash function, vectorizing the binary character string, forming a vector matrix together with a vector of the harmful request URL trained by the model trained through logistic regression, carrying out logistic regression model training on the vector matrix, and judging the similarity between the URL to be recognized and the harmful request URL according to a training result of the harmful request URL marked manually during the logistic regression training so as to judge whether the URL to be recognized is a harmful request or a normal request, thereby automatically classifying the URL without manual operation.

The above technical solutions of the embodiments of the present invention are described in detail below with reference to application examples, and reference may be made to the foregoing related descriptions for technical details that are not described in the implementation process.

In the prior art, although the URL is a very long character string, in the very long character string, whether the request is a harmful request can be judged only by judging whether a specific character string is contained in the long character string, and all the character strings of the URL cannot be fully utilized.

The method and the device of the invention fully utilize the information contained in the character string of the URL, can effectively judge the harmfulness of the character string, and are not easy to be found and bypassed. Extracting URLs of a batch of requests, marking the batch of URLs, dividing the batch of URLs into normal requests and harmful requests, and vectorizing labels corresponding to the URLs, wherein 1 represents a harmful request URL, and 0 represents a normal request URL). The following specific operations are carried out for the URL of the harmful request:

firstly, dividing words of URL: the URLs are sliced by the symbol "&" contained therein, and each URL is formed into a plurality of parameter pieces L: { a1, a2, A3, a4 …. Ah }, for a total of h parameter pieces.

Secondly, defining a Hash Hash function: the Hash is a function method for converting character strings into numbers, after the character strings are converted by the Hash function, the same character strings have the same Hash value, and different character strings have different Hash values.

Thirdly, converting the parameter pieces of each URL: and for each parameter piece set L, traversing each parameter piece Ah by a Hash function, firstly respectively calculating a Hash value Hi of each Ah, then taking the Hash value Hi to take an absolute value, thirdly converting the Hi into a binary format of the character string, and finally taking the first k bits of the binary character string to form a dimension vector of k (1 x k).

Fourthly, splicing vectors: and for each URL, each parameter piece Ah becomes a dimension vector of k, and finally, each URL is spliced to form a vector X of (h X k) dimension, wherein h is the number of the parameter pieces L.

Fifthly, data normalization: and taking a constant s as the number of the maximum parameter pieces of each URL after the data is structured. If h is greater than s, taking binary character strings corresponding to the first s parameter pieces Ah in h; if h < s, performing complementation in front of the (h × k) -dimensional vector X to reach the number of s parameter pieces, namely supplementing (s-h) k-dimensional vectors, wherein each supplemented k-dimensional vector is a full 0 vector, namely supplementing (s-h) k-dimensional 0 vectors to the forefront of the existing h-dimensional vectors. The purpose of this step is to make the number of dimensions of the spliced vector formed by all URLs be s × k, so that the formed data is regular, which is a method for changing URLs into vectors X.

And sixthly, vectorizing the label corresponding to each harmful request URL, and combining the vectorized labels of each harmful request URL to obtain a label vector.

When the method is implemented, a batch of harmful requests are marked manually, the labels of the harmful requests are harmful requests, harmful requests are represented by 1, (if the harmful requests are normal requests, the labels are normal requests, and normal requests are represented by 0), label vectors Y [1,1,1,1 … … ] corresponding to the batch of harmful requests are formed, and a logistic regression model is trained by using the vectors X and the label vectors Y corresponding to the URLs of the batch of harmful requests. And then the logistic regression model can learn the corresponding relation between X and Y by utilizing gradient descent, the training result obtained on the X of the training model after the training is finished is similar to the label Y, and the trained logistic regression model is obtained after the training is finished. Then, a vector matrix X1 is formed by URLs containing to-be-identified URLs (namely, whether the labels are harmful requests or normal requests in unknown states) and previous harmful requests, the labels of the to-be-identified URLs after vectorization are added into a label vector Y to form a to-be-identified label vector Y1, the vector matrix X1 and the label vector Y1 are brought into a previously-trained logistic regression model aiming at the harmful request URLs to be trained, according to the characteristics of logistic regression, the characteristics of data in the previously-trained vector matrix X can be learned to obtain corresponding training results, according to the returned results, the judgment results (such as 0.98, 0.01 and the like) are obtained, a user can judge whether the to-be-identified URLs are harmful requests or normal requests, and therefore classification of the URLs is automatically completed without manual operation. Therefore, when the website server receives the request on line, the received URL can be classified in real time through the logistic regression model, the judgment of the harmful request or the normal request is completed, and if the request is the harmful request, the access can be directly refused.

The technical solution of the present invention is detailed below by specific examples:

now suppose that the URLs extracted by accessing the request log are such that the first four are artificially qualified harmful requests (each tag vector is 1), and the 5 th and 6 th are URLs to be identified, i.e. whether harmful or normal requests are in an unknown state:

(1)abc.com/u？ntype＝wifi&d＝1001&u＝gas

(2)abc.com/u？ntype＝wifi&d＝1001&u＝gms

(3)abc.com/u？ntype＝wifi&d＝1001&u＝gamk

(4)abc.com/u？ntype＝wifi&d＝1001&u＝peas

(5)abc.com/u？ntype＝3g&d＝100299&u＝monk&inter＝true&iv＝22ddac4f&mid＝122

(6)abc.com/u？ntype＝mobile&d＝3282&u＝onelifee&b＝isc&mid＝22399

for the (1) th log, the URL is:

ntype＝wifi&d＝1001&u＝gas

firstly, dividing URL word, according to the word division and truncation, changing into type wifi, d 1001 and u gas

Secondly, the three parameter pieces are subjected to hash conversion, and after the hash function conversion is spoken, the three parameter pieces become:

the hash value corresponding to wifi is: 2893893923183264473,

the hash value for d 1001 is: 399101938414587701,

the hash value for u-gas is: 1773628803862046195

Then, converting the three hash values into binary character strings and taking absolute values, which are sequentially as follows:

0b10100000101001001011010011000011001101000000000000001011011001

0b10110001001111001010001100100110100010111110000111100110101

0b1100010011101001100100000000000000001100110010010010111110011

now, assuming that k is 8 for the first k bits of each binary string, and s is 4 for the constant s of the regular data, each URL finally forms a 32-dimensional vector.

Sequentially taking the first k bits of the three parameter pieces, namely: 10100000, 10110001, 11000100

The vector formed by splicing the first k bits is: 101000001011000111000100

Next, the data needs to be structured, the number h of the parameter pieces of the URL is 3, the constant s of the whole data is 4, so that 1 8-dimensional 0 vector 00000000 (0 is complemented to form a k × s vector) needs to be supplemented until "101000001011000111000100", and the structured vector X is: 00000000101000001011000111000100

When all 6 URLs are vectorized in turn, each URL is formed into a vector of up to 32 dimensions (k × s — 8 × 4 — 32).

(1)00000000101000001011000111000100

(2)00000000101000001110000111000100

(3)00000000101000001101100011000100

(4)00000000101000001111101111000100

(5)11110100010001000100010010100111

(6)00101001001001011101000100110001

Then the tag vector Y consisting of the first four harmful request URLs is [1,1,1,1 ].

The top 4 URLs marked as bad requests are trained by the logistic regression model, and it can be seen that these 4 URLs are relatively similar. At this time, a batch of URL data can be marked manually, the tag vector of the harmful request URL is marked as 1, and the tag vector of the normal request URL is marked as 0. Extracting harmful URL requests, as in the foregoing steps, vectorizing the character strings of the URLs of each harmful request to form 32-dimensional vectors, respectively, and combining the multidimensional vectors of the URLs of all harmful requests to obtain a vector matrix X, where the multidimensional vector of each URL is used as a row of the vector matrix X. And additionally forming a label vector Y corresponding to the harmful request URL. And then carrying out logistic regression model training.

The logistic regression model can adopt tensoflow.keras, and the logistic regression training by using the tensoflow.keras is simpler, more convenient and better to use, has short codes and can be trained by only a few lines. Writing the training data of the 4 harmful requests into a vector matrix X, writing the corresponding labels into Y, and then writing into a tensoflow.

model＝Sequential()

model.add(Dense(1,activation＝tf.nn.sigmoid,input_shape＝(32,)))

model.compile(optimizer＝'adam',loss＝tf.losses.sigmoid_cross_entropy,metrics＝['acc'])

model.fit(x＝X,y＝Y,batch_size＝64,epochs＝10,validation_data＝(Xtest,Ytest))

Training the logistic regression model after the logistic regression model is written, and returning the URL training result of the harmful request after the training is finished:

model.predict(X_topredict)

then, a vector matrix X1 is formed by the 5 th URL, the 6 th URL and the 1 st-4 th multidimensional vectors, the labels of the URLs to be recognized are vectorized and then added into a label vector Y to form a label vector Y1 to be recognized, the vector matrix X1 and the label vector Y1 are brought into a logistic regression model prediction (X _ prediction) for training, according to the characteristics of logistic regression, the characteristics of data in the previous training vector matrix X can be learned, according to the previous training result model prediction (X _ prediction), training results h (X1) of the 5 th URL and the 6 th URL are obtained, according to the result h (X1) returned for the 5 th URL and the 6 th URL, the 5 th URL and the 6 th URL can be classified, and when the training result h (X1) of the URLs to be recognized is larger than a first threshold value, the URL to be recognized is judged as a harmful request; when the training result h (X1) of the URL to be recognized is equal to or less than the first threshold, it is determined that the URL request to be recognized is a harmless request, and it is possible to determine whether the 5 th and 6 th URLs are harmful requests. The X _ prediction is data to be predicted which also needs URL vectorization processing and is written into a matrix (the multidimensional vector of each URL to be recognized is added into a vector matrix X as a line to form a vector matrix X1 to be recognized, and a label corresponding to the URL to be recognized is vectorized and then added into a label vector Y to form a label vector Y1 to be recognized).

The parameter pieces of the URLs of the harmful requests are fragmented and converted into binary character strings, the binary character strings are vectorized to form a vector matrix, and information contained in the character strings in each URL can be fully utilized.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.

The various illustrative logical blocks, or elements, described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.

In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for identifying harmful URLs based on vectorization and logistic regression, comprising:

2. The method for identifying harmful URLs based on vectorization and logistic regression as claimed in claim 1, wherein vectorizing each URL to obtain a corresponding multidimensional vector respectively comprises:

sequentially partitioning the parameters of each URL by using the symbol "&" in the URL as a partitioning point to obtain h parameter pieces; converting each parameter piece into a binary character string, and taking the first k bits of each parameter piece to form a k-dimensional vector;

3. The method of identifying harmful URLs based on vectorization and logistic regression according to claim 2, wherein converting each parameter piece into a binary string comprises:

and traversing each parameter piece of the URL in sequence through a Hash function Hash, calculating to obtain a Hash value of each parameter piece, taking an absolute value of the Hash value, and converting the absolute value into a binary string.

4. The method of identifying harmful URLs based on vectorization and logistic regression as recited in claim 2, further comprising:

and adjusting the number h x k of elements in the multidimensional vector of each URL so that the number of elements in the multidimensional vector of all URLs is the same.

5. The method of claim 1 for identifying harmful URLs based on vectorization and logistic regression, wherein the tags corresponding to the URLs comprise: harmful requests, harmless requests and to-be-identified;

6. An apparatus for identifying harmful URLs based on vectorization and logistic regression, comprising:

a training unit: the logistic regression model is used for inputting the vector matrix X and the vector Y into the logistic regression model for training, and after the training is finished, the trained logistic regression model is obtained; for any URL to be identified, adding the multidimensional vector of the URL to be identified into a vector matrix X as a row to form a vector matrix X1 to be identified; adding the label of the URL to be recognized after vectorization into a label vector Y to form a label vector Y1 to be recognized; inputting a vector matrix X1 to be recognized and a label vector Y1 to be recognized into a trained logistic regression model for training to obtain a training result h (X1) of the URL to be recognized;

7. The apparatus for identifying harmful URLs based on vectorization and logistic regression according to claim 6, wherein the vectorization unit includes:

parameter plate transformation subunit: the method comprises the steps of sequentially partitioning parameters of each URL by using a symbol "&" in the URL as a partitioning point to obtain a plurality of parameter pieces; converting each parameter piece into a binary character string, and taking the first k bits of each parameter piece to form a k-dimensional vector;

8. The apparatus for identifying harmful URLs based on vectorization and logistic regression as recited in claim 7,

parameter plate transformation subunit: the method is specifically used for sequentially traversing each parameter piece of the URL through a Hash function Hash, calculating to obtain a Hash value of each parameter piece, taking an absolute value of the Hash value, and converting the absolute value into a binary string.

9. The apparatus for identifying harmful URLs based on vectorization and logistic regression as recited in claim 7, wherein the vectorization unit further comprises:

a regulating subunit: the number h x k of elements in the multidimensional vector for each URL is adjusted such that the number of elements in the multidimensional vector for all URLs is the same.

10. The apparatus for identifying harmful URLs based on vectorization and logistic regression as recited in claim 6, wherein the tags corresponding to the URLs comprise: harmful requests, harmless requests and to-be-identified;