CN111177596A

CN111177596A - URL (Uniform resource locator) request classification method and device based on LSTM (least Square TM) model

Info

Publication number: CN111177596A
Application number: CN201911353630.6A
Authority: CN
Inventors: 王嘉伟
Original assignee: Weimeng Chuangke Network Technology China Co Ltd
Current assignee: Weimeng Chuangke Network Technology China Co Ltd
Priority date: 2019-12-25
Filing date: 2019-12-25
Publication date: 2020-05-19
Anticipated expiration: 2039-12-25
Also published as: CN111177596B

Abstract

The embodiment of the invention provides a method and a device for classifying URL (Uniform resource locator) requests based on an LSTM (least Square TM) model, which are used for vectorizing each URL to obtain a corresponding vector matrix; combining all the requested vector matrixes to obtain a three-dimensional matrix X; vectorizing the label corresponding to each URL; combining the tags subjected to the URL vectorization of all the requests to obtain a tag vector Y; inputting the three-dimensional matrix X and the label vector Y into an LSTM long-short term memory network model for training, and obtaining a trained LSTM model after the training is finished; inputting a three-dimensional matrix X1 to be recognized and a label vector Y1 to be recognized into a well-trained LSTM model for training to obtain a training result h (X1) of the URL to be recognized; when the training result h (X1) of the URL to be recognized is greater than the preset first threshold, the URL request to be recognized is a harmful request. And judging whether the URL request newly received by the URL to be identified is a harmful request or a harmless request according to the training result of the LSTM long-short term memory network model on the harmful URL.

Description

URL (Uniform resource locator) request classification method and device based on LSTM (least Square TM) model

Technical Field

The invention relates to the field of computers, in particular to a URL request classification method and device based on an LSTM model.

Background

The website provides data for users, but some users frequently visit the website by illegal means, so that the website is jammed or paralyzed. Therefore, the anti-grab station system collects and blocks the part of abnormally accessed users.

In the process of implementing the invention, the applicant finds that at least the following problems exist in the prior art:

the conventional method for confirming the illegal user through the ip is difficult, because the ip is frequently replaced through the illegal user, the use amount of each ip is limited, and whether the user is the illegal user or not is difficult to judge.

Disclosure of Invention

The embodiment of the invention provides a URL request classification method and device based on an LSTM model, which judge whether a URL request newly received by a URL to be identified is a harmful request or a harmless request according to a training result of the LSTM long-short term memory network model to the harmful URL.

To achieve the above object, in one aspect, an embodiment of the present invention provides a method for classifying URL requests based on an LSTM model, including:

obtaining a plurality of URLs which are determined to be harmful requests and a plurality of URLs which are determined to be harmless requests;

vectorizing each URL to respectively obtain corresponding vector matrixes; combining vector matrixes of all harmful requests and harmless requests to obtain a three-dimensional matrix X, wherein the vector matrix of each URL in the three-dimensional matrix X is arranged in parallel;

vectorizing the label corresponding to each URL; combining all harmful requests and tags subjected to URL vectorization of harmless requests to obtain a tag vector Y;

inputting the three-dimensional matrix X and the label vector Y into a long-short term memory network LSTM model for training, and obtaining a trained LSTM model after the training is finished;

extracting any URL request to be identified, and adding a vector matrix of the URL request to be identified into a three-dimensional matrix X to form a three-dimensional matrix X1 to be identified; adding the label of the URL request to be identified after vectorization into a label vector Y to form a label vector Y1 to be identified; inputting a three-dimensional matrix X1 to be recognized and a label vector Y1 to be recognized into a well-trained LSTM model for training to obtain a training result h (X1) of the URL request to be recognized; and

when the training result h (X1) of the URL request to be recognized is greater than a preset first threshold value, judging the URL request to be recognized as a harmful request; and when the training result h (X1) of the URL request to be recognized is less than or equal to a first threshold value, judging the URL request to be recognized as a harmless request.

On the other hand, an embodiment of the present invention provides an apparatus for classifying URL requests based on an LSTM model, including:

an acquisition unit: the system comprises a server and a server, wherein the server is used for acquiring a plurality of URLs which are determined to be harmful requests and a plurality of URLs which are determined to be harmless requests;

a vectorization unit: the vector matrix generation unit is used for vectorizing each URL to respectively obtain corresponding vector matrixes; combining vector matrixes of all harmful requests and harmless requests to obtain a three-dimensional matrix X, wherein the vector matrix of each URL in the three-dimensional matrix X is arranged in parallel; vectorizing the label corresponding to each URL; combining all harmful requests and tags subjected to URL vectorization of harmless requests to obtain a tag vector Y;

the model generation unit is used for inputting the three-dimensional matrix X and the label vector Y into the long-short term memory network LSTM model for training, and obtaining a trained LSTM model after the training is finished;

a deep training unit: the method is used for extracting any URL request to be identified, and adding a vector matrix of the URL request to be identified into a three-dimensional matrix X to form a three-dimensional matrix X1 to be identified; adding the label of the URL request to be identified after vectorization into a label vector Y to form a label vector Y1 to be identified; inputting a three-dimensional matrix X1 to be recognized and a label vector Y1 to be recognized into a well-trained LSTM model for training to obtain a training result h (X1) of the URL request to be recognized;

a determination unit: the method is used for judging that the URL request to be recognized is a harmful request when a training result h (X1) of the URL request to be recognized is larger than a preset first threshold value; and when the training result h (X1) of the URL request to be recognized is less than or equal to a first threshold value, judging the URL request to be recognized as a harmless request.

The technical scheme has the following beneficial effects: when the website server receives the request on line, the received URL can be judged in real time through the LSTM long-short term memory network model, and the information of the URL is fully utilized, so that the URL is not easy to be discovered and bypassed by a requester, and the judgment efficiency and the judgment accuracy are improved. Because the LSTM long-short term memory network model is a deep learning model, the LSTM long-short term memory network model can automatically learn, and has good fitting effect on data with sequence, namely, the information of the parameter piece (key) and the relative sequence information of the parameter piece (key sequence) can be utilized. Therefore, the URLs of a plurality of harmful requests are vectorized to form a three-dimensional matrix, and the three-dimensional matrix is trained by using an LSTM long-short term memory network model, so that the characteristics of the URLs of the harmful requests can be summarized.

Therefore, the URL to be recognized is vectorized, a three-dimensional matrix is formed together with the trained harmful request URL, the three-dimensional matrix is subjected to LSTM long-term and short-term memory network model training, the similarity between the URL to be recognized and the harmful request URL can be judged according to a result of learning the training result of the artificially marked harmful request URL, namely the similarity between the URL to be recognized and the key of the parameter piece in the harmful request URL is high, so that whether the URL to be recognized is a harmful request or a harmless request is judged, the URL is automatically classified, and manual operation is not needed. The URLs of a plurality of harmful requests are vectorized to form a three-dimensional matrix, and effective information contained in character strings in each URL can be fully utilized. Especially when the URL between different versions has vector dislocation caused by parameter piece difference caused by slight difference, the LSTM long-short term memory network model can be used for identifying the relative sequence information (key sequence) of the parameter piece, so that the keys of the same parameter piece arranged in a dislocation way can be identified, namely, the identification degree of similar and different URLs is improved, and the identification probability of harmful request URLs is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a flowchart illustrating a URL request classification method based on an LSTM model according to an embodiment of the present invention;

fig. 2 is a schematic structural diagram of a URL request classifying device based on the LSTM model according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Definitions of some abbreviations and key terms to which the present invention relates

The anti-grabbing station system: websites output data to users, some of which use machines to simulate human web page access requests for various reasons. The access is carried out by a machine, and the access is usually large and frequent, so the health state of the server is adversely affected. The anti-seize station system is a system for blocking the abnormal access. And the anti-capture station system analyzes the real-time access log, judges the capture station ip and blocks the ip.

Example of a uniform resource locator URL:

abc.com/user？u＝1&cm＝44

com is a domain name,/user is an interface, u is 1, and cm is 44 is a parameter; the parameters include keys and values by which a key-value pair is formed, e.g., a key with u ═ 1 is u and a value is 1.

LSTM: the long-short term memory network model belongs to a deep learning model, can train some labeled data, then carries out classification prediction on unlabeled data, and can process the data of a sequence.

As shown in fig. 1, a schematic flowchart of a URL request classification method based on an LSTM model according to an embodiment of the present invention provides a URL request classification method based on an LSTM model, which includes:

s101: obtaining a plurality of URLs which are determined to be harmful requests and a plurality of URLs which are determined to be harmless requests;

s102: vectorizing each URL to respectively obtain corresponding vector matrixes; combining vector matrixes of all harmful requests and harmless requests to obtain a three-dimensional matrix X, wherein the vector matrix of each URL in the three-dimensional matrix X is arranged in parallel;

s103: inputting the three-dimensional matrix X and the label vector Y into a long-short term memory network LSTM model for training, and obtaining a trained LSTM model after the training is finished;

s104: extracting any URL request to be identified, and adding a vector matrix of the URL request to be identified into a three-dimensional matrix X to form a three-dimensional matrix X1 to be identified; adding the label of the URL request to be identified after vectorization into a label vector Y to form a label vector Y1 to be identified; inputting a three-dimensional matrix X1 to be recognized and a label vector Y1 to be recognized into a well-trained LSTM model for training to obtain a training result h (X1) of the URL request to be recognized;

s105: when the training result h (X1) of the URL request to be recognized is greater than a preset first threshold value, judging the URL request to be recognized as a harmful request; and when the training result h (X1) of the URL request to be recognized is less than or equal to a first threshold value, judging the URL request to be recognized as a harmless request.

Preferably, vectorizing each URL to obtain a corresponding vector matrix respectively, including S1021:

sequentially partitioning the parameters of each URL by using the symbol "&" in the URL as a partitioning point to obtain h parameter pieces;

acquiring a key of each parameter piece, converting the key of each parameter piece into a binary character string, and taking the first k bits of the binary character string of each key to form a k-dimensional vector;

and arranging the k-dimensional vectors of each key as a column according to the segmentation sequence to form a k-h vector matrix, wherein k is the row number of the vector matrix, and h is the column number of the vector matrix.

Preferably, converting the key of each parameter piece into a binary string comprises:

s1021-1: and traversing keys of each parameter piece of the URL in sequence through a Hash function, calculating to obtain a Hash value of each key, taking an absolute value of the Hash value, and converting the absolute value into a binary character string.

Preferably, vectorizing each URL to obtain a corresponding vector matrix, respectively, further includes S1022:

the number of columns of the vector matrix is adjusted so that the vector matrix for each URL has the same number of columns, s, specifically,

when h is larger than s, taking k-dimensional vectors of the first s parameter pieces;

when h is less than s, b columns of k-dimensional 0 vectors are supplemented before the first column, wherein b is s-h.

Preferably, the tag corresponding to the URL includes: harmful requests, harmless requests and to-be-identified;

vectorizing the tag corresponding to each URL, specifically including: and if the label corresponding to the URL is a harmless request, setting the label vectorization to be 0, if the label corresponding to the URL is a harmful request, setting the label vectorization to be 1, and if the label corresponding to the URL is to be identified, setting the label vectorization to be an unknown number y.

As shown in fig. 2, a schematic structural diagram of an apparatus for classifying URL requests based on LSTM model according to an embodiment of the present invention provides a URL request classification based on LSTM model, including:

the acquisition unit 21: the system comprises a server and a server, wherein the server is used for acquiring a plurality of URLs which are determined to be harmful requests and a plurality of URLs which are determined to be harmless requests;

the vectorization unit 22: the vector matrix generation unit is used for vectorizing each URL to respectively obtain corresponding vector matrixes; combining vector matrixes of all harmful requests and harmless requests to obtain a three-dimensional matrix X, wherein the vector matrix of each URL in the three-dimensional matrix X is arranged in parallel; vectorizing the label corresponding to each URL; combining all harmful requests and tags subjected to URL vectorization of harmless requests to obtain a tag vector Y;

the model generating unit 23 is configured to input the three-dimensional matrix X and the label vector Y into the long-short term memory network LSTM model for training, and obtain a trained LSTM model after training is completed;

the deep training unit 24: the method is used for extracting any URL request to be identified, and adding a vector matrix of the URL request to be identified into a three-dimensional matrix X to form a three-dimensional matrix X1 to be identified; adding the label of the URL request to be identified after vectorization into a label vector Y to form a label vector Y1 to be identified; inputting a three-dimensional matrix X1 to be recognized and a label vector Y1 to be recognized into a well-trained LSTM model for training to obtain a training result h (X1) of the URL request to be recognized;

the determination unit 25: the method is used for judging that the URL request to be recognized is a harmful request when a training result h (X1) of the URL request to be recognized is larger than a preset first threshold value; and when the training result h (X1) of the URL request to be recognized is less than or equal to a first threshold value, judging the URL request to be recognized as a harmless request.

Preferably, the vectorization unit 22 includes a parameter slice segmentation transformation unit 221, and the parameter slice segmentation transformation unit 221 is configured to:

Preferably, the parameter piece-dividing transformant unit 221 is specifically used for:

and traversing keys of each parameter piece of the URL in sequence through a Hash function, calculating to obtain a Hash value of each key, taking an absolute value of the Hash value, and converting the absolute value into a binary character string.

Preferably, the vectorization unit 22 further includes a regularization subunit 222, and the regularization subunit 222 is specifically configured to:

the vectorization unit 22 is specifically configured to: and if the label corresponding to the URL is a harmless request, setting the label vectorization to be 0, if the label corresponding to the URL is a harmful request, setting the label vectorization to be 1, and if the label corresponding to the URL is to be identified, setting the label vectorization to be an unknown number y.

The technical scheme of the embodiment of the invention has the following beneficial effects: when the website server receives the request on line, the received URL can be judged in real time through the LSTM long-short term memory network model, and the information of the keys of the URL parameter pieces is fully utilized, so that the URL parameter pieces are not easy to be found and bypassed by a requester, and the judgment efficiency and the judgment accuracy are improved. Because the LSTM long-short term memory network model is a deep learning model, the LSTM long-short term memory network model can automatically learn, and has good fitting effect on data with sequence, namely, the information of the parameter piece (key) and the relative sequence information of the parameter piece (key sequence) can be utilized. Therefore, the keys of the URL parameter pieces of the harmful requests are converted into binary character strings through the Hash function, the binary character strings are vectorized to form a three-dimensional matrix, and the three-dimensional matrix is trained by using an LSTM long-short term memory network model, so that the characteristics of the harmful request URLs can be summarized. The keys of the URL parameter piece to be identified are converted into a binary string via a hash function, vectorizing the binary string to form a three-dimensional matrix together with the trained bad request URL, the three-dimensional matrix is used for carrying out LSTM long-term and short-term memory network model training, a result can be obtained according to the training result of learning the harmful request URL and the harmless request URL marked by people during training, according to the result, the similarity between the URL to be identified and the harmful request URL is judged, namely the similarity between the URL to be identified and the key of the parameter piece in the harmful request URL is high (namely, one key of the URL to be identified is similar to one key of the harmful request URL and has a plurality of pairs of similar keys with the harmful request URL), so as to judge whether the URL request to be identified is a harmful request or a harmless request, thereby automatically finishing the classification of the URL without manual operation.

The method comprises the steps of fragmenting keys of parameter pieces of URLs of a plurality of harmful requests into binary character strings, vectorizing the binary character strings to form a three-dimensional matrix, and fully utilizing effective information contained in the character strings in each URL. Especially, when the URL between different versions (for example, the version 4.0.0 of the microblog client may send URL 1 and the version 8.0.0 initiates the request URL2. the parameters are the same or different), the relative sequence information (key sequence) of the parameter pieces can be identified according to the characteristics of the LSTM long-short term memory network model, so that the keys of the same parameter pieces which are arranged in a staggered manner can be identified, that is, the identification degree of the URLs which are similar but different is improved, and the identification probability of the harmful request URL is improved.

The above technical solutions of the embodiments of the present invention are described in detail below with reference to application examples, and reference may be made to the foregoing related descriptions for technical details that are not described in the implementation process.

The invention relates to a URL request classification method and a device based on an LSTM model, which judge whether the URL of an access request is a harmful URL or not by utilizing information contained in a character string of the URL.

When the method is implemented, extracting URLs of a batch of requests, marking the batch of URLs, dividing the batch of URLs into harmless requests and harmful requests, vectorizing labels corresponding to all the URLs, wherein 1 represents a harmful request URL, and 0 represents a harmless request (normal request) URL, combining the URL vectorized labels of the requests to obtain a label vector Y, and forming the label vector Y corresponding to the batch of request URLs: (n, 1), n indicates that the n vector Y has n rows, and the specific label vector Y is [1,1,1,1 … …,0, 0 ].

The process of vectorizing the URL for each request is as follows:

the method comprises the steps of segmenting words of URLs, sequentially separating parameters of each URL by using a symbol "&" in the URL as a separation point to obtain h parameter pieces, and forming a plurality of parameter piece sets { A1, A2, A3 and A4 …. Ah } for each URL, wherein the parameter pieces are h in total. Extracting keys of all parameter pieces to form a key piece set L: { K1, K2, K3 … Kh }. Since the Hash is a function method for converting a character string into a number, a Hash function is defined so that after keys of a parameter piece are converted by the Hash function, the same character string has the same Hash value, and different character strings have different Hash values. The Hash Hash function traverses each parameter piece Ki, firstly, a Hash value Hi of each Ki is calculated, then, an absolute value is taken from the Hash value Hi, Hi is converted into a binary format of a character string, finally, the front k bits of the binary character string are taken to form a k-dimensional vector, and calculation and time cost are reduced while a judgment result can be obtained through normal training.

The k-dimensional vectors of each key are arranged in the dividing order as columns to form a vector matrix V of k x h, wherein h represents the number of parameter pieces.

The number of columns of the vector matrix V is adjusted so that the vector matrix V for each URL has the same number of columns. That is, to order data, each vector has the same number of rows k and the same number of columns, and the formed three-dimensional matrix X is ordered without any vector vacancy. And taking a constant s as the number corresponding to the key of the maximum parameter piece of each URL after the data is structured. For each URL, if h > s, taking a binary string corresponding to keys Ki of the first s parameter pieces in h; if h < s, the vector matrix V of (k x h) is supplemented in front of the (k x h) to reach the number of s keys, namely the vectors of (s-h) columns need to be supplemented, and the supplemented (s-h) column vectors are all 0 vectors, namely the (s-h) column 0 vectors are all supplemented in front of the first column of the existing vector matrix V. The purpose of this step is to have vectors in the spliced vectors formed by all URLs, and there is no gap, i.e. the complete matrix.

Combining all vector matrices V of harmless requests and harmful requests to obtain a three-dimensional matrix X: (n, k, s), where n is the number of requested URLs, and a vector matrix V for each requested URL in the three-dimensional matrix X is arranged in parallel, and the training data (URL) is written into a three-dimensional matrix X (n is the number of URLs in the training data set).

And training the LSTM long-term and short-term memory network model by using the three-dimensional matrix X and the label vector Y corresponding to the URLs of the request. And then the LSTM long-short term memory network model can learn the corresponding relation between X and Y by gradient descent, the training result obtained on the X of the training model after training is similar to the label Y (namely the training result can represent URL as a harmful request and a harmless request), and the well-trained LSTM long-short term memory network model is obtained after training is finished. Then, a three-dimensional matrix X1 is formed by the URLs containing the URLs to be identified (namely, whether the labels are harmful requests or harmless requests in unknown states) and the URLs of the previous trained requests, the labels of the URLs to be identified after vectorization are added into a label vector Y to form a label vector Y1 to be identified, a three-dimensional matrix X1 and a label vector Y1 are brought into the LSTM long and short term memory network model which is trained aiming at the harmful requests and URLs before for training, the characteristics of the LSTM long and short term memory network model can be deeply learned according to the characteristics of the LSTM long and short term memory network model so as to obtain corresponding training results, the returned results are judgment results (for example, the training results of a URL to be identified are 0.98 and 0.01, the training results of another URL to be identified are 0.6 and 0.7, and the like), whether the URLs to be identified are harmful requests or harmless requests can be judged according to the judgment results, therefore, the URL classification is automatically completed without manual operation. Therefore, when the website server receives the request on line, the received URL can be classified in real time through the LSTM long-short term memory network model, the judgment of harmful request or harmless request is completed, and if the request is harmful, the website server can directly refuse to access.

The technical solution of the present invention is detailed below by specific examples:

now assume that the URLs extracted by accessing the request log are the URLs for the first four artificially marked harmful requests (1 for each tag vector), and the 5 th and 6 th harmless requests:

(1)abc.com/u？ntype＝wifi&d＝1001&u＝gas&mid＝3381

(2)abc.com/u？ntype＝wifi&d＝1001&u＝gms&mid＝3381

(3)abc.com/u？ntype＝wifi&d＝1001&u＝gamk&mid＝3381

(4)abc.com/u？ntype＝wifi&d＝1001&u＝peas&mid＝3381

(5)abc.com/u？ntype＝3g&d＝100299&u＝monk&inter＝true&iv＝22ddac4f&mid＝122

(6)abc.com/u？ntype＝mobile&d＝3282&u＝onelifee&b＝isc&mid＝22399

for the (1) th log, the URL is: ntype & d & u & gas & mid & 3381;

dividing the URL into words, and changing the words into parameter pieces according to the following steps: ntype is wifi, d is 1001, u is gas, mid is 3381; extracting the key of each parameter piece, which is respectively: ntype, d, u, mid; the values corresponding to the keys are: wifi, 1001, gas, 3381.

And performing hash conversion on the keys of the parameter pieces, wherein after the hash function conversion is spoken, the key is changed into:

the hash value corresponding to ntype is: 6244237488287224263, d corresponds to a hash value of: 3002026730155008313, u corresponds to a hash value of: 7716079414827596089, mid corresponds to a hash value of: 3517718010798098266.

then, the four hash values are converted into binary character strings, which are sequentially:

0b101011010100111111111001110100110000110100010011001000111000111，

0b10100110101001010101110110011111011001000111011110000100111001，

-0b110101100010101000001010101100010100011100000001011100100111001，

0b11000011010001011100011110010010100101100111010100011101011010。

and taking the absolute value as:

0b101011010100111111111001110100110000110100010011001000111000111，

0b10100110101001010101110110011111011001000111011110000100111001，

0b110101100010101000001010101100010100011100000001011100100111001，

0b11000011010001011100011110010010100101100111010100011101011010。

now assume that the first k bits k of each binary string is taken to be 8: 10101101, 10100110, 11010110,11000011

And form a vector matrix V with a size of (8, 4):

1111

0011

1100

0010

1000

1110

0111

1001

when the constant s of the regular data is 4, the vector does not need to be increased or decreased.

1111

0011

1100

0010

1000

1110

0111

1001

All 6 URLs are vectorized in sequence, and a vector matrix V formed by each URL is as follows:

1 st URL:

1111

0011

1100

0010

1000

1110

0111

1001

vectorizing the other 5 URLs like the first URL, forming a vector matrix V of 8 × 4 for each URL, and combining the multidimensional vectors of all 6 requested URLs to obtain a three-dimensional matrix X: (n, k, s), wherein the vector matrix V of each harmful request URL is arranged in parallel in the three-dimensional matrix X. The label vector Y composed of six URLs is [1,1,1,1, 0,0 ].

Then, the 6 URLs marked as requests are trained through an LSTM long-short term memory network model, and the 1 st URL to the 4 th URL are similar, and the 5 th URL to the 6 th URL are similar. Keras can be trained by the deep learning model LSTM long and short term memory network model s. The specific operation of training the 6 request URLs is as follows:

model＝Sequential()

model.add(LSTM(32,input_shape＝(8,4,)))

model.add(Dense(1,activation＝tf.nn.sigmoid))

model.compile(optimizer＝'adam',loss＝tf.losses.sigmoid_cross_entropy,metrics＝['acc'])

model.fit(x＝X,y＝Y,batch_size＝64,epochs＝10,validation_data＝(Xtest,Ytest))

the model is trained once it is written. After the training is completed:

Ypredict＝model.predict(X_topredict)

at this time, a batch of URL data can be marked manually, the label vector of the harmful request URL is marked as 1, and the label vector of the harmless request URL is marked as 0. And extracting the harmful request URL and the harmless request URL, and converting the keys of each parameter piece in each request URL into a character string vectorization process according to the steps.

Next, the set of URL-written matrices X _ prediction (size (k, s)) that need to be determined as harmful can be predicted to see if it is a harmful request. The X _ prediction also needs URL matrixing, and after this step is completed, the shape of Ypredict is (n, 1), i.e. the harmfulness prediction result of the deep learning model LSTM on X _ prediction.

Then, the URL to be recognized and the vector matrix V of the 1 st to the 6 th form a three-dimensional matrix X1 together, the label of the URL to be recognized is vectorized and then added into a label vector Y to form a label vector Y1 to be recognized, the three-dimensional matrix X1 and the label vector Y1 are brought into an LSTM long and short term memory network model, prediction (X _ prediction) for training, according to the characteristics of logical regression of the LSTM long and short term memory network model, which is capable of deep learning the features of the data in the previously trained three-dimensional matrix X, obtaining a training result h (X1) of the URL of the request to be recognized from the previous training result Ypredict (model). Based on the judgment result that the result h (X1) returned for the URL which is the identification request is, the URL thereof can be classified, when the training result h (X1) of the URL to be recognized is larger than a first threshold value, judging that the URL request to be recognized is a harmful request; when the training result h (X1) of the URL to be recognized is equal to or less than the first threshold, it is determined that the URL request to be recognized is a harmless request, and it is possible to determine whether the URL of each request to be recognized is a harmful request. The X _ prediction also needs URL vectorization processing, and then writes data to be predicted in a matrix form (adding a vector matrix V of the URL to be recognized into a three-dimensional matrix X to form a three-dimensional matrix X1 to be recognized; adding a label of the URL to be recognized after vectorization into a label vector Y to form a label vector Y1 to be recognized).

When the website server receives the request on line, the received URL can be judged in real time through the LSTM long-short term memory network model, and the information of the keys of the URL parameter pieces is fully utilized, so that the URL parameter pieces are not easy to be found and bypassed by a requester, and the judgment efficiency and the judgment accuracy are improved. Because the LSTM long-short term memory network model is a deep learning model, the LSTM long-short term memory network model can automatically learn, and has good fitting effect on data with sequence, namely, the information of the parameter piece (key) and the relative sequence information of the parameter piece (key sequence) can be utilized. Therefore, the keys of the URL parameter pieces of the harmful requests are converted into binary character strings through the Hash function, the binary character strings are vectorized to form a three-dimensional matrix, and the three-dimensional matrix is trained by using an LSTM long-short term memory network model, so that the characteristics of the harmful request URLs can be summarized. The keys of the URL parameter pieces to be identified are also converted into binary strings via the hash function, vectorizing the binary string to form a three-dimensional matrix together with the trained bad request URL, the three-dimensional matrix is used for carrying out LSTM long-term and short-term memory network model training, a result can be obtained according to the training result of learning the harmful request URL and the harmless request URL marked by people during training, according to the result, the similarity between the URL to be identified and the harmful request URL is judged, namely the similarity between the URL to be identified and the key of the parameter piece in the harmful request URL is high (namely, one key of the URL to be identified is similar to one key of the harmful request URL and has a plurality of pairs of similar keys with the harmful request URL), so as to judge whether the URL request to be identified is a harmful request or a harmless request, thereby automatically finishing the classification of the URL without manual operation. The method comprises the steps of fragmenting keys of parameter pieces of URLs of a plurality of harmful requests into binary character strings, vectorizing the binary character strings to form a three-dimensional matrix, and fully utilizing effective information contained in the character strings in each URL. Especially, when the URL between different versions (for example, the version 4.0.0 of the microblog client may send URL 1 and the version 8.0.0 initiates the request URL2. the parameters are the same or different), the relative sequence information (key sequence) of the parameter pieces can be identified according to the characteristics of the LSTM long-short term memory network model, so that the keys of the same parameter pieces which are arranged in a staggered manner can be identified, that is, the identification degree of the URLs which are similar but different is improved, and the identification probability of the harmful request URL is improved.

It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.

In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising" is interpreted when employed as a transitional word in a claim. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".

Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.

The various illustrative logical blocks, or elements, described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.

In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A URL request classification method based on an LSTM model is characterized by comprising the following steps:

2. The LSTM model-based URL request classification method of claim 1, wherein vectorizing each URL to obtain a corresponding vector matrix respectively comprises:

3. The LSTM model-based URL request classification method of claim 2, wherein converting the key of each parameter piece into a binary string comprises:

4. The LSTM model-based URL request classification method of claim 2, wherein vectorizing each URL to obtain a corresponding vector matrix respectively, further comprises:

5. The LSTM model-based URL request classification method of claim 1, wherein the URL corresponding tag comprises: harmful requests, harmless requests and to-be-identified;

6. An apparatus for classifying URL requests based on LSTM model, comprising:

7. The LSTM model-based URL request classification device according to claim 6, wherein the vectorization unit includes a parameter piece segmentation transformation unit, the parameter piece segmentation transformation unit is configured to:

8. The LSTM model-based URL request classification device according to claim 7, wherein the parameter piece segmentation transformation subunit is specifically configured to:

9. The LSTM model-based URL request classification device of claim 7 wherein the vectorization unit further includes a warping subunit, the warping subunit being specifically configured to:

10. The LSTM model-based URL request classification device of claim 6 wherein the URL mapping tag comprises: harmful requests, harmless requests and to-be-identified;

the vectorization unit is specifically configured to: and if the label corresponding to the URL is a harmless request, setting the label vectorization to be 0, if the label corresponding to the URL is a harmful request, setting the label vectorization to be 1, and if the label corresponding to the URL is to be identified, setting the label vectorization to be an unknown number y.